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(57) Abstract 

An internal RISC-type instruction structure furnishes a fixed bit-length template including a plurality of defined bit fields for a plurality 
of operation (Op) formats. One format includes an instruction-type bit field, two source-operand bit fields and one destination -operand bit 
field for designating a register-to-register operation. Another format is a load-store formal that includes an instruction- type bit field, an 
identifier of a source or destination register for the respective load or store operation, and bit fields for specifying die segment, base and 
index parameters of an address. 
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WO 97/13194 PCT/US96/I5422 

RISCSf INSTRUCTION SET 

TECHNICAI FTTT.n 

This mventkm relates to processors. More specifically, this invention relates to pro cesaois that convert 
an x86 instructions into RISC-type operations for execution on a RISC-type core. 



BACKGROUND ART 

10 Advanced microprocessors, such as Po-cUss x86 pr ocessor s , are defined by a common set of features. 

These features include a superscalar architecture and performance, decoding of multiple x86 instructions per 
cycle and conversion of the multiple x86 instructions into RISC-like operations. The RISC-like operations are 
executed out-of-order in a RISC-type core mat is decoupled from decoding. These advanced microprocessors 
support large instruction windows for reordering instructions and for reordering memory references. 

IS The conversion or translation of CISC-like instructions into RISC-like instructions typically involves a 



mapping of many like-type instructions, for example many variations of an ADD instniction, into one or more 
generic instructions . Furthermore, a very large number of addressing modes of a particular instruction are 

i with load and store operations accessing memory. 
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converted into a legister-to-fegisterj 
Performance o f a micro pr oc esso r that converts CISC-type instructions into RISC-type instructions for execu tion 
depends greatly on the number of RISC-type operations produced from a single CISC-type instructi on. 

Performan ce of an advanced s up ers ca lar micro pro cessor is also highly dependent upon decoding 
performance including decoding speed, the number of x86 instructions dec od e d in a single cycle, and branch 
prediction performance and handling. Instnic tion decoders in advanced superscalar microprocessors often 
include one or more H««**imj> pathways in which x86 instructions are decoded by hardware logic trap gfotiffP ffH 
a separaTe" decoding pathway whic h uses a ROM memory for fetching a RISC operation sequence that 
Corresponds to an x86 instruction. Generally, x86 instructions that are translated bv hardware logic are simpl e 
x86 inst ructions. The lookunJlOM is used to decode more complex x86 mstructions. 

One problem with the usage of lookup ROM for (trending x86 instructions is that the process of 
accessing a microprogram control store is inherently slower and less efficient than hardwired translation of 
mstructions, A former problem arising with decoding of x86 mstructions via lookup ROM is the very large 
number of Afferent CISC-type instructions that are standard for an x86 processor. Since a substantial number 
of instructions are implemented, a large ROM circuit on the processor chip is necessary for converting the 
instructions to RISC-like operations. The large number of implemented instructions corresponds to an increased 
circuit complexity for deriving pointers to the ROM and applying the derived pointers to the ROM. This 
increased circuit complexity directly relates to an increased overhead that reduces instruction decoding 
throughput A large lookup ROM increases the sue of the processor integrated circuit and thereby reduces 
numufacturmg yields of the circuits and increases production costs. 
— "What is needed is an internal instruction format that facilitates translation of a very large munber of 



CISC-type mstructions into a small number of RISC-type operations. What is also needed is an internal 
40 instruction format that facilitates conversion of CISC-type instructions into a minimum number of RISC-type 
operations. What is further needed is an internal instruction format that permits more common CISC-type 
instructions to be converted using hardwired logic, as compared to conversion via lookup ROM. 
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DISCLOSURE OF THE INVENTION 

In accordance with the present invention, an interna) RISC-type instruction structure furnishes a fixed 
bit-length templ a t e including a plurality of defined bit fields for a plurality of operation (Op) formats. One 
format includes an instruction-type bit field, two source-operand bit fields and one destination-operand bit field 
5 for designating a register-to-register operation. Another format is a load-store format that " Ww fcs an 
instruction-type bit field, an identifier of a source or destination register for the respective load or store 
operation, and bit fields for specifying the segment, base and index parameters of an address, an «~*fx scaling 
factor and a displacement. 

In accordance with a first embodiment of the present invention, a RISC-like internal instruction set 

10 executes on a RISC-like core of a superscalar pr ocess or . The RISC-like internal instruction set is translated 
from instructions of a CISC-like external instruction set The instruction set includes a plurality of instruction 
codes arranged in a fixed bit-length structure. The structure being divided into a plurality of defined-usage bit 
fields. The instruction codes have a plurality of operation (Op) formats of the plurality of defined-usage bit 
fields. One Op format is a first register Op format having a format bit-field designating the instruction code as 

15 a first register Op format code, a type bit-field ^yM*w»g a register instruction type, a first source operand bit- 
field identifying a first source operand, a second source operand bit-field identifying a second source operand, a 
destination bit-field designating a destination operand, and an operand size bit-field *Wtgniting an operand byte- 
size. A second Op format is a load-store Op format having a format bit-field ***qgr«»ing the instruction code as 
a load-store Op format code, a type bit-field designating a load-store instruction type, a data bit-field identifying 

20 a destination-source of a load-store operation, an index scale factor bit-field for ***"g»«»»«"g an index scale 

factor, a segment bit-field designating a segment register, a base bit-field A^g^ting a load-store base address, 
a displacement bit-field ^f"*"! a load-store address displacement, and an index bit-field designating a load- 
store address index. Various bit-fields of the instructioo code are substituted into the Op format with the 
substituted bit-field determined as a function of processor context 

25 Many advantages are gained by the described instruction set and instruction set format One substantial 

advantage is that the described instruction set is highly regular with a fixed length and format of RISC 
operations. By comparison, conventional x86 instructions are highly irregular having greatly different 
instruction lengths and formats. A regular structure greatly reduces circuit complexity and size and substantially 
increases efficiency. A further advantage is mat extensions are included for efficient mapping of operations by 

30 the decoder. The extensions support direct decoding and emulation or mapping of x86 instructions. Another 
advantage is that the described instruction set is implemented in a reduced ROM memory size due to reuse of 
operation structures for multiple variations that are common among x86 instructions. Usage of substitution 
achieves encoding of CISC functionality while substantially reducing the size of lookup code ROM, 
advantageously reducing the size and cost of a processor integrated circuit 

35 
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brief PEScrarnoN or the drawings 

The features of the invention believed to be novel are specifically set forth in the appended claims. 
However, the invention itself, both as to ita structure and method of operation, may best be understood by 
referring to the following description and accompanying drawings. 
5 Figure 1 is a block diagram which illustrates a co mpute r system in accordance with one embodiinent of 

the present invention* 

Figure 2 is a block diagram ilhtstrmting one embodiment of processor for usage in the computer system 
shown in Figure 1. 

Figure 3 is a timing diagram which flhistrates pipeline timing for an embodiment of the pr oce ss or 
10 shown in Figure 2. 

Figure 4 is a schematic block diagram showing an embodiment of an instruction decoder used in the 
pr ocessor shown in Figure 2. 

Figure 5 is a schematic block diagram which depicts a structure of an emulation code sequencer and an 
emulation code memory of the instruction decoder shown in Figure 4. 
15 Figures 6A through €E are pictorial illustrations showing a plurality of operation (Op) formats 

generated by the instruction decoder shown in Figure 4. 

Figure 7 is a pictorial depiction of an OpSeq field format employed in the emulation code memory 
shown in Figure 5. 

Figure 8 is a pictorial depiction of an exemplary instruction substitution. 
20 Figure 9 is a schematic block diagram ulustrating an exemplary embodiment of a scheduler. 

Figure 10 is a block diagram of a personal computer incorporating a processor having an instruction 
decoder which implements a RISC86 instruction set in accordance with an embodiment of the present invention, 

MQPEfS) TOR CARRYING OUT THE INVBmQM 

25 Referring to Figure 1, a computer system 100 is used in a variety of applications, including a personal 

computer application. The computer system 100 includes a computer motherboard 110 containing a pr oce ssor 
120 in accordance with an embodiment of the invention* Processor 120 is a monolithic integrated circuit which 
executes a complex instruction set so mat the processor 120 may be termed a complex instruction set computer 
(CISC). Examples of complex instruction sets are the x86 instruction sets implemented on the well known 8086 

30 family of microprocessors. The processor 120 is connected to a level 2 (L2) cache 122, a memory controller 
124 and local bus controllers 126 and 128. The memory controller 124 is connected to a main memory 130 so 
mat the memory controller 124 forms an interface between the processor 120 and the mam memory 130. The 
local bus controllers 126 and 128 are connected to buses including a PCI bus 132 and an ISA bus 134 so mat 
the local bus controllers 126 and 128 form internees between the PCI bus 132 and the ISA bus 134. 

35 Referring to Figure 2, a block diagram of an embodiment of proce sso r 120 is shown. The core of the 

processor 120 is a RISC superscalar processing engine. Common x86 i nstructio ns are converted by instruction 
decode hardware to operations in an internal RISC86 instruction set. Other x86 instructions, exception 
processing, and other miscellaneous functionality is implemented as RISQ&6 operation snqiieocea stored in on- 
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chip ROM. Proce ssor 120 has in terfa ces mgh»<K«g a system interface 210 sod an L2 cache control logic 212. 
The system interface 210 connects the processor 120 to other blocks of the computer system 100. The 
processor 120 arrcmnn the address space of the q i nswitrn system 100, mcluding the main memory 130 and 
devices on local buses 132 and 134 by read and write accesses via the system interface 210. The L2 cache 
5 control logic 212 forms an interface between an external cache, such as the L2 cache 122, and the p rocessor 
120. Specifically, the L2 cache control logic 212 interfaces the L2 cache 122 and to an instruction cache 214 
and a data cache 216 in the processor 120. The instruction cache 214 and the data cache 216 are level 1 (LI) 
caches which are connected through the L2 cache 122 to the address space of the computer system 100. 

Instructions from main memory 130 are loaded into instruction cache 214 via a predeooder 270 for 

10 anticipated execution. The predeooder 270 generates p redecode bits that are stored in combination with 

instruction bits in the instruction cache 214. The predecode bits, for example 3 bits, are fetched along with an 
associated instruction byte (8 bits) and used to facilitate multiple instruction decoding and reduce decode time. 
Instruction bytes are loaded into instruction cache 214 thirty-two bytes at a time as a burst transfer of four eight- 
byte quantities. Logic of the predecoder 270 is replicated eight times for usage four times in a cache line so 

15 that predecode bits for all eight instruction bytes are calculated simultaneously immediately before being written 
into the instruction cache 214. A predecode operation on a byte typically is based on information in one, two 
or three bytes so that predecode information may extend beyond an eight-byte group. Accordingly, the latter 
two bytes of an eight-byte group are saved for pro ce ssin g with the next eight-byte group in case of predecode 
information mat overlaps two eight-byte groups. Instructions in instruction cache 214 are CISC instructions, 

20 referred to as macroinstructions. An instruction decoder 220 converts CISC instructions from instruction cache 
214 into operations of a reduced instruction set computing (RISC) architecture instruction set for execution on 
an execution engine 222. A single inacroinstxuction from instruction cache 214 decodes into one or multiple 
operations for execution engine 222. 

Instruction decoder 220 has interface connections to the instruction cache 214 and an instruction fetch 

25 control circuit 218 (shown in Figure 4). Instruction decoder 220 includes a macroinstruction decoder 230 for 
A****ing most mactoinstructions, an instruction emulation circuit 231 including an emulation ROM 232 for 
decoding a subset of instructions such as complex instructions, and a branch unit 234 for branch prediction and 
handling. Macromstructions are classified according to the general type of operations into which the 
nutcranstructions are converted. The general types of operations are register operations (RegOps), load-store 

30 operations (LdStOps), load immediate value operations (UMMOps), special operations (SpecOps) and floating 
point operations (FpOps). 

Execution engine 222 has a scheduler 260 and six execution units including a load unit 240, a store unit 
242, a first register unit 244, a second register unit 246, a floating point unit 248 and a multimedia unit 250. 
The scheduler 260 distributes operations to appropri ate execution units and the execution units operate in 
35 parallel. Each execution unit executes a particular type of operation. In particular, the load unit 240 and the 
store unit 242 respectively load (read) data or store (write) data to the data cache 216 (LI data cache), the L2 
cache 122 and the main memory 130 while executing a load/store operation (LdStOp). A store queue 2G 
temporarily stores data from store unit 242 so mat store unit 242 and load unit 240 operate in parallel without 
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conflicting accesses to data cache 216. Register units 244 and 246 execute register operations (RegOpe) for 
accessing a register file 290. Floating point unit 248 executes floating point operations (FpOps). Multimedia 
unit 250 CTfrff*f arithmetic operations for nailtimedia ap pbcatK ms. 

g frfrpiHuW 260 is partitioned into a plurality of, for example, 24 entries where each entry contains 
5 storage and logic The 24 entries are grouped into six groups of four entries, called Op quads. InJbnnatkm in 
the storage of an entry describes an operation for execution, whether or not the execution is pending or 
completed. The m4wfr a W monitors the entries and dinpafrhf* information from the entries to information- 
designated execution units. 

Referring to figure 3, processor 120 employs five and six stage bask pipeline tuning. Instruction 

10 decider 220 decode* two ro^ During a first stage 310, the instiuction fetch 

control circuit 218 fetches CISC instructions into instruction cache 214. Predecoding of the CISC instructions 
during stage 310 reduces subsequent decode time. During a second stage 320, instruction decoder 220 decodes 
instructions from inirtrwti"" cache 214 and loads an Op quad into scheduler 260. During a third stage 330, 
scheduler 260 scans the entries and issues operations to corresponding execution units 240 to 252 if an operation 

15 for the respective types of execution units is available. Operands for the operations issued during stage 330 are 
forwarded to the execution units in a fourth stage 340. For a RegOp, the operation generally completes in the 
next dock cycle which is stage 350, but LdStOps require more time for address calculation 352, data access and 

transfer of the results 362. 

For branch operations, instruction decoder 220 per fo rms a branch prediction 324 during an initial 

20 decoding of a branch operation. A branch unit 252 evaluates conditions for the branch at a later stage 364 to 
dffr rmin* whether the branch prediction 324 was correct A two level branch prediction algorithm predicts a 
direction of conditional branching, and fetching CISC instructions in stage 310 and decoding the CISC 
instructions in stage 320 continues in the predicted branch direction. Scheduler 260 deterininet when all 
condition codes required for branch evaluation are valid, and directs the branch unit 252 to evaluate the branch 

25 instruction. If a branch was incorrectly predicted, c?erations in the scheduler 260 which should not be executed 
are flushed and decoder 220 begins loading new Op quads from the correct address after the branch. A time 
penalty is incurred as instructions for the correct branching are fetched. Iiistniction decoder 220 eitherreadsa 
previously-stored predicted address or cakulates an address using a set of parallel adders. If a previously- 
predicted address is stored^ the predicted address is fetched in stage 326 and instructions located at the predicted 

30 address are fetched in stage 328 without a delay for adders. Otherwise, parallel adders calculate the predicted 
address. 

In branch evaluation stage 364, branch unit 252 determines whether the predicted branch direction is 
correct. If a predicted branch direction is correct, the fetching, decoding, and instruction-executing steps 
continue without mterruption. For an incorrect prediction, scheduler 260 is flushed and instruction decoder 220 
35 begins decoding macrranstructkms from the correct program counter subsequent to the branch. 

Referring to figure 4, a schematic block diagram illustrates an embodiment of an instruction 
preparation circuit 400 which is connected to the niain memory 130. The instruction preparation circuit 400 
mdudes the instruction cache 214 that is connected to the main memory 130 via the predecoder 270. The 



SUBSTITUTE SHEET (RULE 26) 



WO 97/13194 



6 



PCT/US96715422 



instructioo decoder 220 is connected to receive instruction bytes sod predeoode bits from time alternative 
sources, die instruction cache 214, a bunch target buffer (BTB) 456 and an instruction buffer 408. The 
instruction bytes and predeoode bits are supplied to the instruction decoder 220 through a plurality of rotators 
430, 412 and 434 via instruction registers 450, 4S2 and 454. The macmnstruction decoder 230 has input 
5 connections to the instruction cache 214 and instruction fetch control circuit 218 for receiving instruction bytes 
and as so ci a ted predecode information. The maennnstruction decoder 230 buffers fetched instruction bytes in an 
instruction buffer 408 connected to the instruction fetch control circuit 218. The instruction buffer 408 is a 
sixteen byte buffer which receives and buffers up to 16 bytes or four aligned words from the instruction 
214, loading as much data as allowed by the amount of free space in the instruction buffer 408. The instruction 
10 buffer 408 holds the next instruction bytes to be decoded and continuously reloads with new instruction bytes as 
old ones are processed by the macroinstruction decoder 230. Instructions in both the instruction cache 214 and 
the instruction buffer 408 are held in "extended" bytes, containing both memory bits (8) and predecode bits (5), 
and are held in the same alignment The predecode bits assist the macminstruction decoder 230 to perform 
multiple instruction decodes within a single clock cycle. 
15 Instruction bytes addressed using a decode program counter (PC) 420, 422 or 424, are transferred from 

the instruction buffer 408 to the macroinstruction decoder 230. The instruction buffer 408 is «"rn ?fipd on a byte 
basis by decoders in the macromstruction decoder 230. However on each decode cycle, the instruction buffer 
408 is managed on a word basis for tracking which of the bytes in the instruction buffer 408 are valid and 
which are to be reloaded with new bytes from the instruction cache 214. The designation of whether an 
20 instruction byte is valid is maintained as the instruction byte is decoded. For an invalid instruction byte, 

decoder invalidation logic (not shown), which is connected to the macroinstruction Apmfrr 230, sets a "byte 
invalid" signal. Control of updating of the current fetch PC 426 is synchronized closely with the validity of 
instruction bytes in the instruction buffer 408 and the consumption of the instruction bytes by the instruction 
decoder 220. 

25 The macroinstruction decoder 230 receives up to sixteen bytes or four aligned words of instruction 

bytes fetched from the instroction fetch control circuit 218 at the end of a fetch cycle. Instruction bytes from 
the instruction cache 214 are loaded into a 16-byte instruction buffer 408. The instruction buffer 408 buffers 
instruction bytes, pros predecode information associated with each of the instruction bytes, as the instruction 
bytes are fetched and/or decoded. The instruction buffer 408 receives as many instruction bytes as can be 

30 arroimnodited by the instruction buffer 408 free space, holds the next instruction bytes to be decoded and 

continually reloads with new instruction bytes as previous instruction bytes are transferred to individual decoders 
within the macroinstniction decoder 230. The instruction predecoder 270 adds predecode information bits to 
the instruction bytes as the instruction bytes are transferr ed to the mstruction cache 214. Ther efor e, the 
instruction bytes stored and transferred by the instruction cache 214 are called extended bytes. Each rmmrtfd 

35 byte includes eight memory bits plus five predeoode bits. The five predecode bitB include three bits that encode 
mstruction length, one D-bit that designates whether the instruction length is D-bit «*« p«n drnt y and a 
HasModRM bit that indicates whether an instruction code includes a modrm field. The thirteen bits are stored 
in the instruction buffer 408 and passed on to the macroinstruction decoder 230 decoders. The instruction buffer 
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408 expands each set of five predecode bits into six pradeoode bits. Predecode bits enable the decoders to 
quickly perform multiple instruction decodes within one clock cycle. 

The instruction buffer 408 receives mstrucrion bytes from the instruction cache 214 in the memory- 
aligned word basis of instruction cache 214 storage so that instructions are loaded and replaced with word 
5 granularity. Thus, the instruction buffer 408 byte location 0 always holds bytes that are addressed in memory at 
an address of 0 (mod 16). 

Instruction bytes are transferred from the instruction buffer 408 to the inacroinstruction decoder 230 
with byte granularity. During each decode cycle, the sixteen extended instruction bytes within the instruction 
buffer 408 , j ft*i™*«"g ftf****** 1 implicit word valid bits, are transferred to the plurality of decoders within the 

10 macroinstruction decoder 230. This method of transferring instruction bytes from the instruction cache 214 to 
the macromstroction decoder 230 via the instruction buffer 408 is repeated with each decode cycle as long as 
instructions are sequentially decoded. When a control transfer occurs, for example due to a taken branch 
operation, the instruction buffer 408 is flushed and the method is restarted. 

The current decode PC has an ar b i tr ar y byte alignment in that the instruction buffer 408 has a capacity 

15 of sixteen bytes but is managed on a four-byte word basis in which all four bytes of a word are consumed 

before removal and replacement or the word with four new bytes in the instruction buffer 408. An instruction 
has a length of one to eleven bytes and multiple bytes are decoded so that the alignment of an instruction in the 
instruction buffer 408 hi arbitrary. As instruction bytes are transferred from the instruction buffer 408 to the 
inacromstruction decoder 230, the instruction buffer 408 is reloaded from the instruction cache 214. 

20 Instruction bytes are stored in the instruction buffer 408 with memory alignment rather than a 

wnpH-nti n 1 byte ■ti gmwit mat is suitable for application of consecutive mstrucrion bytes to the macroinstruction 
decoder 230. Therefore, a set of byte rotators 430, 432 and 434 are interposed between the instruction buffer 
408 and each of the decoders of the rnacroinstruction decoder 230. Four uistruction decoders, including three 
short decoders SDecO 410, SDecl 412 or SDec2 414, and one combined long and vectoring decoder 418, share 

25 the byte rotators 430, 432 and 434. In particular, the short decoder SDecO 410 and the combined long and 
vectoring decoder 418 share byte rotator 430. Short decoder SDecl 412 is asso c ia te d with byte rotator 432 and 
short decoder SDec2 414 is associated with byte rotator 434. 

A plurality of pipeline registers, specifically instruction registers 450, 452 and 454, are interposed 
between the byte rotators 430, 432 and 434 and the instruction decoder 220 to temporarily hold the mstrucrion 

30 bytes, predecode bits and other information, thereby shortening the decode tuning cycle. The other infbnnation 
held in the instruction registers 450, 452 and 454 includes various information for assisting instruction decoding, 
in^lnHing prefix (e.g. OF) status, hr"r~* Q *» size (8-bit or 3l-bit) f displacement and long decodable length 
designations. 

Although a circuit is shown utilizing three rotators and three short decoders, in other embo d iments , 
35 different numbers of circuit elements may be employed. For example, one circuit includes two rotators and 
two short decoders. 



SUBSTITUTE SHEET (RULE 26) 



WO 97/13194 



8 



PCT/US96/15422 



Instructions are stored in memory alignment, not instruction alignment, in the instruction 214, 
the branch target buffer (BTB) 456 and the instruction buffer 408 so that the location of the first instruction 
byte is not known. The byte rotators 430, 432 and 434 find the first byte of an instruction. 

The macromstructkn decoder 230 also p er for ms various instruction decode and exception decode 
5 operations, mchiding validation of decode operations and selection between different types of decode operations. 
Functions performed during decode operations include prefix byte handling, support for vectoring to the 
emulation code ROM 232 for emulation of instructions, and for branch unit 234 operations, branch unit 
interfacing and return address prediction. Based on the instruction bytes and associated information, the 
macromstruction decoder 230 generates operation information in groups of four operations cnmxpnnAtng to Op 
10 quads. The macromstruction decoder 230 also generates instruction vectoring control information and *m»io*y* n 
code control informatian. The macromstruction decoder 230 also has output connections to the scheduler 260 
and to the emulation ROM 232 for ontputting the Op quad information, instruction vectoring control information 
and emulation code control information. The macromstruction decoder 230 does not decode instructions when 
the scheduler 260 is unable to accept Op quads or is accepting Op quads from emulation code ROM 232. 
15 The macromstruc tion decoder 230 has five distinct and separate decoders, including three "short" 

decoders SDecO 410, SD ec l 412 and SDec2 414 that function in combination to decode up to three "short" 
decode operations of instructions thatare defined within a subset of 1 simple instructions of the x86 matra gtimi 
^eet Generally, a simple instruction is an instruction that translates to fewer than three operations. The short 
decoders SDecO 410, SDecl 412 and SDec2 414 each typically generate one or two operations, although zero 
20 ope r ati o ns are generated in certain cases such as prefix decodes. Accordingly for three short decode operations, 
from two to six operations are generated in one decode cycle. The two to six operations from the three short 
decoders are subse qu en t ly packed together by operation packing logic 438 into an Op quad since a ™i™t^ G f 
four of the six operations are valid. Specifically, the three short decoders SDecO 410, SDecl 412 and SDec2 
414 each attempt to decode two operations, potentially generating six operations. Only four operations may be 
25 produced at one time so that if more man four operations are produced, the operations from the short decoder 
SDec2 414 are invalidated. The five decoders also include a single "long" decoder 416 and a single "vectoring" 
decoder 418. The long decoder 416 decodes instructions or forms of instructions having a more complex 
address mode form so mat more than two operations are gen erated and short decode handling if not available. 
The vectoring decoder 418 handles mstrnctions that cannot be handled by operation of the short decoders SDecO 



30 410, SDecl 412 and SDec2 414 or by the long decoder 416. The vectoring decoder 418 does not actually , 

<v . — * i i i — ^_ 

^decode an instruction, but ratherj fectorMo a location ^>f emulation ROM 232 for emulation of the instruction. 
Various exception conditions that are detected by the ma cro in stroction decoder 230 are aTSThandled as a special 
form of vectoring decode operation. When activated, the long decoder 416 and the vectoring decoder 418 each 
generates a full Op quad. An Op quad generated by short decoders SDecO 410, SDecl 412 and SDec2 414 has 
35 the same format as an Op quad generated by the long and vectoring decoders 416 and 418. The short decoder 
and long decoder Op quads do not include an OpSeq field. The macromstruction decoder 230 selects either the 
Op quad generated by the short decoders 410, 412 and 414 or the Op quad generated by the long decoder 416 
or vectoring decoder 418 as an Op quad result of the macromstruction decoder 230 are each decode cycle. 
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Short decoder operation, long decoder operttion end vectoring decoder operation function in parallel and 
independently of one Another, although the results of only one decoder are used at one tune. 

Each of die short decoders 410, 412 and 414 decodes up to seven instruction bytes, assuming the first 
byte to be an operation code (opcode) byte and the instruction to be a short decode instruction. Two operations 
5 (Ops) are g f t rr* *** 1 with correspo n ding valid bits. Appropriate values for effective address size, effective data 
ate current x86-standard B-bit, and any override operand segment register are supplied for the gene rat ion 
of operations dependent on these parameters. The logical address of the next "sequential" instruction to be 
ri fCTHM is supplied for use in generating the operations for a CAT J. instruction. Note that the word sequential 
is placed in quotation narks to indicate that, although the sequential address generally points to an instruction 

10 which immediately precedes the present instruction, the 'sequential" address may be set to any addressed 

location. The current branch prediction is supplied for use in generating the operations for conditional transfer 
control instructions. A short decode generates control signals including indications of a transfer control 
instruction (for example, Jcc, LOOP, JMP, CALL), an unconditional transfer control instruction (for example, 
JMP, CALL), a CALL instruction, a prefix byte, a cc-dependent RegOp, and a designation of whether the 

15 instruction length is address or data size dependent Typically one or both operations are valid, but prefix byte 
and JMP dofpdfs do not generate a valid op. Invalid operations appear as valid NOOP operations to pad an Op 
quad. 

The first short decoder 410 generates operations based on more than decoding of the instruction bytes. 
The first short decoder 410 also determines the presence of any prefix bytes decoded during preceding decode 

20 cycles. Various prefix bytes include OF, address size override, operand size override, six segment override 
bytes, REP/REPE, REPNE and LOCK bytes. Each prefix byte affects a subsequent instruction decode in a 
defined way. A count of prefix bytes and a count of consecutive prefix bytes are accumulated during decoding 
and furnished to the first short decoder SDecO 410 and the long decoder 416. The consecutive prefix byte count 
is used to check whether an instruction being decoded is too long. Prefix byte count information is also used to 

25 control subsequent decode cycles, jnrhiAmg /»w4riwg for certain types of instruction-specific exception 
wm di tion g. Prefix counts are reset or wwts«i;— h it the end of each successful non-prefix decode cycle in 
preparation for <Ww««fl the prefix and opcode bytes of a next instruction. Prefix counts are also reinitialized 
when the macroinstruction decoder 230 decodes branch condition and write instruction pointer (WRIP) 
operations. 

30 Prefix bytes are processed by the first short decoder 410 in the manner of one-byte short decode 

instructions. At most, one prefix byte is decoded in a decode cycle, a condition thai is enforced through 
invalidation of all short decodes following the decode of a prefix byte. Effective address size, data size, 
operand segment register values, and the current B-bit, are supplied to the first short decoder 410 but can 
decode along with preceding opcodes. 

35 The address size prefix affects a decode of a subsequent instruction bom for of instructions 

for which the generated operation depends on effective address size and for decoding of the address mode and 
instruction length of modr/m instructions. The default address size is specified by a currently-specified D-bit, 
which is effectively toggled by the occurrence of one or more address size prefixes. 
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Use operand sub prefix also affects the decode of a subsequent instruction both for decoding of 
instructions for which the generated operation depends on effective data size and for «***»v«iig of the instruction 
length. The default operand size is specified by a cuirendy-sperified x86-standard D-bit, which is effectively 
toggled by die occarrence of one or more «p*™^ sub prefixes. 
5 The segment override prefixes affect the decode of a subsequent instruction only in a case when the 

generation of a load-store operation (LdStOpe) is d ependent on the effective operand segment of the instruction. 
The default segment is DS or SS f dryrnding on the associated general address mode, and is replaced by the 
segment specified by the last segment override prefix. 

The REP/REPE and REPNE prefixes do not affect the decode of a subsequent instruction. If the 
10 instruction is decoded by the microinstruction decoder 230, rather than the ennilation code ROM 232, then any 
preceding REP prefixes are ignored* However, if the instruction is vectored, then the generation of the vector 
address is modified in some cases. Specifically, if a string instruction or partkular neighboring opcode is 
vectored, men an indication of the occurrence of one or more of the REP prefixes and designation of the last 
REP prefix encountered are included in the vector address. For all other instructions the vector address is not 
15 modified and the REP prefix is ignored. 

A LOCK prefix inhibits all short and long decoding except the dfrey«ing of prefix bytes, forcing the 
subsequent instruction to be vectored. When the vector decode cycle of this subsequent instruction occurs, so 
long as the subsequent instruction is not a prefix, the opcode byte is checked to ensure that the instruction is 
within a "lockable" subset of the instructions. If die instruction is not a lockable instruction, an exception 
20 condition is recognized and the vector address generated by the vectoring decoder 418 is replaced by an 
exception entry point address. 

Instructions decoded by die second and third short de coder s 412 and 414 do not have prefix bytes so 
mat decoders 412 and 414 assume fixed default values for address size, data size, and operand segment register 
values. 

25 Typically, the three short decoders generate four or fewer operations because three consecutive short 

decodes are not always p er fo rmed and instructions often short decode into only a single operation. However, 
for die rare occurrence when more than four valid operations are generated, operation p^""g logic 438 inhibits 
or invalidates the third short decoder 414 so that only two instructions are successfully decoded and at most four 
operations are generated for p* r4r * T1 g into an Op quad. 

30 When the first short decoder 410 is unsuccessful, the action of die second and third short decoders 412 

and 414 are invalidated When the second short decoder 412 is unsuccessful, the action of the third short 
decoder 414 is invalidated. When even the first short decode is invalid, the decode cycle becomes a long or 
vectoring decode cycle. In general, the macroizurtruction decoder 230 attempts one or more short decodes and, 
if such short decodes are unsuccessful, attempts one long decode. If the long decode is unsuccessful, the 

35 macrouistruction decoder 230 per fo rms a vectoring decode. Multiple conditions cause the short decoders 410, 
412 and 414 to be invalidated. Most generally, short decodes are invalidated when the instruction operation 
code (opcode) or die designated address mode of a modr/m instruction does not fall within a defined short 
decode or "simple" subset of instructions. This condition typically restricts short decode instructions to those 
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operation* that generate two or fewer opermtkns. Short decodes are also invalidated when not all of the bytes in 
the instruction buffer 408 for a decoded instruction are valid. Also, "cc-dependent - operations, operations that 
are fof-ir*™* on status flags, are only generated by the first short decoder 410 to ensure that these op er ati ons 
are not preceded by and \cc" RegOps. A short decode is invalidated for a second of two consecutive short 

5 d tW~ when the immediately preceding short decode was a decode of a transfer control instruction, regardless 
of the direction taken. A short decode is invalidated for a second of two consecative short decodes when the 
first short decode was a decode of a prefix byte. In general, a prefix code or a transfer control code inhibits 
further decodes in a cycle. 

furthermore, no more than sixteen instruction bytes are consumed by the macromfrfruction decoder 230 

10 since the instruction buffer 408 only holds sixteen bytes at one time. Also, at most four operations can be 

P^|t*nH m to an Op quad. These constraints only affect the third short decoder 414 since the length of each short 
d fHT^ mstruction is at most seven bytes and operations in excess of four only arise in the third short decoder 
414. 

In a related constraint, if the current D-bit value specifies a 16-bit address and data size default, then 

15 an instruction having a length that is address and/or data dependent can only be handled by the first short 

dfppdfr 410 since the predecode information is probably incorrect Also, when multiple instruction dec o d ing is 
rtp*H~*, only the first short decoder 410 is allowed to successfully decode instructions and prefix bytes. 

Validation tests are by short decoder validation logic (not shown) in the macroinstruction 

decoder 230 and are independent of the operation of short decoders 410, 412 and 414. However, each of the 

20 short decoders 410, 412 and 414 does set aero, one or two valid bits dependi n g on the number of operations 
decoded. These valid bits, a total of six for the three short decoders 410 f 412 and 414, are used by the 
operation packing logic 438 to determine which operations to pack into an Op quad and to force invalid 
operations to appear as NOOP (no operation) operations. The operation packing logic 438 operates without 
short decoder validation information since valid short decodes and associated operations are preceded only by 

25 other valid short decodes and associated operations. 

The short decoders 410, 412 and 414 also generate a plurality of signals lepresenting various special 
opcode or modr/m address mode decodes. These signals indkate whether a certain form of instruction is 
currently being decoded by the instruction decoder 220. These signals are used by short decode validation logic 
to handle short decode validation situations. 

30 The mstruction bytes, which are stored tmalignad in the instruction buffer 408, are aligned by byte 

rotators 430, 432 and 434 as the instruction bytes are transferred to the decoders 410-418. The first short 
decoder SDecO 410, the long decoder 416 and the vectoring decoder 418 share a first byte rotator 430. The 
second and third short decoders SDecl 412 and SDec2 414 use respective second and third byte rotators 432 
and 434. During each decode cycle, the three short decoders SDecO 410, SDecl 412 and SDec2 414 attempt to 

35 decode what are, most efficiently, three short decode operations using three independenUy^iperating and parallel 
byte rotators 430, 432 and 434. Although the multiplexing by the byte rotators 430, 432 and 434 of appropriate 
bytes in the instruction buffer 408 to each respective decoder SDecO 410, SDecl 412 and SDec2 414 is 
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conceptailly d epe nd e n t on the preceding instruction decode operation, infraction length lookahead logic 436 
uses predecode bits to enable the decoders to operate substantially in parallel. 

The long and vectoring decoders 416 and 418, in combination, perform two parallel decodes of eleven 
instruction bytes, taking the first byte to be an opcode byte and generating cither a long instruction decode Op 
quad or a vectoring decode Op quad. Information analyzed by the long and vectoring decoders 416 and 418 
includes effective address size, effective data size, the current B-bit and DF-bit, any override operand segment 
register, and logical addresses of the next 'sequential* and target instructions to be decoded. The long and 
vectoring decoders 416 and 418 generate decode signals including an instruction length excluding preceding 
prefix bits, a d esign a tion of whether the instruction is within the long decode subset of instructions, a RET 
mstruction, and an effective operand segment register, based on a default implied by the modr/m address mode 
plus any segment override. 

During a decode cycle in which none of the short decoders SDecO 410, SDecl 412 and SDec2 414 
successfully decodes a short instruction, the inacroiiistruction decoder 230 attempts to perform a long decode 
using the long decoder 416. If a long decode cannot be performed, a vectoring decode is performed. In some 
embodiments, the long and vectoring decoders 416 and 418 are conceptually separate and ^^p^rnt decoders, 
just as the kmg and vectoring decoders 416 and 418 are separate and ^Ay^fnt of die short decoders 410, 
412 and 414. Physically, however, the long and vectoring decoders 416 and 418 share much logic and generate 
similar Op quad outputs. Instructions decoded by the long decoder 416 are generally included within the short 
decode subset of instractions except for an address mode constraint such as that the instruction cannot be 
decoded by a short decoder because the instruction length is greater than seven byton or htroi »~ tfrr mMrrro has 
a large displ ac e men t that would require generation of a third operation to handle to Hicpio^^^t The long 
decoder 416 also decodes certain additional modr/m instructions that are not in the short decode subset but are 
sufficiently common to warrant hardware decoding. Instruction bytes for usage or decoding by the long decoder 
416 are supplied from the mstruction buffer 408 by the first byte rotator 430, the same instruction multiplexer 
that supplies instruction bytes to the first short decoder SDecO 410. However, while the first short decoder 
SDecO 410 receives only seven bytes, the long decoder 416 receives up to eleven coMncuri^ mrtniction bytes, 
corresponding to the m a x i mum length of a modr/m instruction «rclnrfirig prefix bytes. Thus, the first byte 
rotator 430 is eleven bytes wide although only the first seven bytes are connected to the first short decoder 
SDecO 410. The long decoder 416 only decodes one mstruction at a time so that predecode 
information within the instruction buffer 408 is not used and is typically invalid. 

The first byte of the first byte rotator 430 is fully decoded as an opcode byte and, in the case of a 
modr/m instruction, the second instruction byte and possibly the third are fully decoded as modr/m and sib 
bytes, respectively. The existence of a OF prefix is considered in decoding of the opcode byte. The OF prefix 
byte inhibits all short decoding since all short decode instructions are non-OF or 'one-byte" opcodes. Because 
all prefix bytes are located within the "one-byte" opcode space, decoding of a OF prefix forces the next d wod g 
cycle to be a two-byte opcode instruction, such as a long or vectoring decode instruction. In addition to 
generating operations based on the decoding of modr/m and sib bytes, the first byte rotator 430 also determines 
the length of the instruction for usage by various program counters, whether the instruction is a modr/m 
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instruction for jnh flw titig or invalidating the long decoder, and whether the instruction is an instruction within 
die kw»g decode subset of operation codes (opcodes). The long decoder 416 always generates four operations 
mnd t like die short decoders 410, 412, and 414, pr e sents the ope r ati on s in the form of an emulation code-like 
Op quad, ^huting an OpSeq field. Hie long decoder 416 handles only relatively simple modr/m instructions. 
5 A long decode Op quad has two possible forms mat differ only in whether the third operation is a load 

operation (LdOp) or a store operation (StOp) and whether the fourth operation is a RegOp or a NOOP. A first 
long decode Op quad has the form: 



sTj>/d o<<s»m»«aua,as.i 

NOOP 

The ®(<gaxn>) address mode specification r ep r e se nts an address calculation corresponding to that 
specified by the modr/m and/or sib bytes of the instruction, for example ®(AX + BX*4 + LD). Hie 

20 <tmm32> and <disp32> VShirff are fi™r k y*» containing thg immediate and displacement tnrtmcrinn 

bytes when the decoded instruction contains such values. 

The long decoder 416, like the first short decoder 410, generates operations taking into account the 
presence of any prefix bytes decoded by the short decoders during preceding decode cycles. Effective address 
size, data size, operand segment register values, and the current B-bit are supplied to the long decoder 416 and 

25 are used to generate operations. No indirect size or segment register specifiers are included within the final 
operations generated by the long decoder 416. 

Only a few ron<titK*n« inhibit or invalidate an otherwise successful long decode. One such condition is 
an instruction operation code (opcode) that is not included in the long decode subset of instructions. A second 
condition is mat not all of the instruction buffer 408 bytes for the decoded instruction are valid. 

30 The vectoring decoder 418 handles instructions mat are not decoded by either the short decoders or the 

long decoder 416. Vectoring decodes are a default case when no short or long decoding is possible and 
sufficient valid bytes are available. Typically, the instructions handled by the vectoring decoder 418 are not 
n rM- 1 in the short decode or long decode subsets but also result from other conditions such as decoding being 
disabled or the drtrrtiuw of an exception condition. During normal operation, only non-short and non-long 

35 instructions are vectored. However, aD instructions may be vectored. Undefined opcodes are always vectored. 
Only prefix bytes are always decoded. Prefix bytes are always decoded by the short decoders 410, 412 and 
414. 

When an exception condition is detected during a decode cycle, a vectoring decode is forced, generally 
overriding any other form of decode without regard for instruction byte validity of the de code d mstruction. 
40 When a detected exception condition forces a vectoring decode cycle, the generated Op quad is iimfcfinnd and 
the Op quad valid bit for presentation to the scheduler 260 is forced to no. The Op quad valid bit informs the 
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scheduler 260 that no operations are to be loaded to the scheduler 260. As a result, no Op quad is loaded into 
the scheduler 260 during an exception vectoring decode cycle. 

Few conditions inhibit or invalidate a vectoring decode. One such condition is that not all of the bytes 
in the instruction buffer 408 are valid. 
5 When an instruction is vectored, control is transferred to an emulation code entry point An emulation 

code entry point is either in internal emulation code ROM 232 or in externa] emulation code RAM 236. The 
emulation code starting from the entry point address either emulates an instruction or "»itiaft»fi appropriate 
exception processing. 

A vectoring decode cycle is properly considered a macxoinstruction decoder 230 decode cycle. In the 
10 case of a vectoring decode, the macroinstroction decoder 230 generate the vectoring quad and generate the 
emulation code address into the emulation code ROM 232. Following the initial vectoring decode cycle, the 
macrconstruction decoder 230 remains inactive while instructions are generated by the «p"intipn code ROM 232 
or emulation code RAM 236 until a return from emulation (ERET) OpSeq is encountered. The return from 
emulation (ERET) se q ue nc ing action transitions back to macroinstroction decoder 230 decoding. During the 
15 decode cycles following the initial vectoring decode cycle, the macioinstruction decoder 230 remains inactive, 
continually attempting to decode the next "sequential" instruction but having decode cycles repeatedly 
invalidated until after the ERET is encountered, thus waiting by default to decode the next "sequential" 
instruction. 

Instruction bytes for usage or decoding by the vectoring decoder 418 are supplied from the instruction 

20 buffer 408 by the first byte rotator 430, the same instruction multiplexer that supplies instruction bytes to the 
first short decoder SDecO 410 and to the long decoder 416. The vectoring decoder 418 receives up to eleven 
consecutive instruction bytes, corresponding to the maximum length of a modr/m instruction excluding prefix 
bytes. Thus, the roll eleven byte width of the first byte rotator 430 is distributed to both the long decoder 416 
and the vectoring decoder 418. The predecode information within the instruction buffer 408 is not used by the 

25 vectoring decoder 418. 

As in the case of the long decoder 416, the first byte of the first byte rotator 430 is fully decoded as an 
opcode byte and, in the case of a modr/m instruction, the second instruction byte and possibly the third are fully 
decoded as modr/m and sib bytes, respectively. The vectoring decoder 418 generates operations tnHn g into 
account the presence of any prefix bytes decode d by the short decoders during preceding decode cycles. The 

30 exi ste n c e of a OF prefix is considered in decoding of the opcode byte. In mfrfitiiro to generating operations 
based on the decoding of modr/m and sib bytes, the first byte rotator 430 also determines the length of the 
instruction for usage by various piogiaui counters, whether the instruction is a modr/m instruction for inhibiting 
or invalidating the long decoder, and whether the instruction is an instruction within the long decode subset of 
operation codes (opcodes). If not, a vectoring decode is initiated. Effective address sue, data size and operand 

35 segment register values are supplied to the vectoring decoder 418 and are used to generate operations. No 
indirect size or segment register specifiers are included within the final operations generated by the vectoring 
decoder 418. 
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During a vectoring decode cycle, the vectoring decoder 418 generates a vectoring Op quad* g e ner a t es 
an emulation code entry point or vector address, and initializes an emulation environment The vectoring Op 
quad is specified to pass various information to initialize emulation environment scratch registers. 

The value of the emulation code entry point or vector address is based on a decode of the first and 
5 second instruction bytes, for example the opcode and modr/m bytes, plus other infbnnatiao such as the presence 
of an OF prefix* REP prefix and the like, hi the case of vectoring caused by an exception condition, the entry 
point or vector address is based on a simple encoded exception identifier. 

The * n fnl *** nw environment is stored for resolving environment dependencies. All of the short 
decoders 410, 412 and 414 and long decoder 416 directly resolve environmental dependencies, such as 
10 dependencies upon effective address and data sizes, as operations are generated so that these operations never 
contain indirect size or register specifiers. However, emulation code operations do refer to such effective 
address and data size values for a particular it***"*^ of the instruction being emulated. The emulation 
environment is used to store this additional information relating to the particular instruction that is vectored. 
This information inrhfdfff general register numbers, effective address and data sizes, an effective operand 
15 figment register number, the prefix byte count, and a record of the existence of a LOCK prefix. The emulation 
environment also loads a modr/m reg field and a modr/m regm field are loaded into Reg and Regm registers. 
The emulatkm environment is initialized at the end of a successful vectoring decode cycle and remains at the 
initial state for substantially the duration of the emulation of an instruction by emulation code, until an BRET 
code is encountered. 

20 The vectoring decoder 418 generates four operations of an Op quad in one of four forms. All four 

forms include three LIMM operations. The four forms differ only in the immediate values of the UMM 

operations and in whether the third operation is an LEA operation or a NOOP operation. 

A first vectoring decode Op quad has the form: 

UMM f2.<tmn32> 
25 UMM tl,<dup32> 

LEA t6 r O(<gim>) > ^ 

UMM t7,I ngScqnncPCPl ..0) //logic*! acq. next 

/rami. PC 

A second vectoring decode Op quad has the form: 

30 

UMM f2,<imo£2> 
UMM U,<dup32> 
NOOP 

UMM ajbo&xpccfC&l.JOl 

35 A third vectoring decode Op quad has the form: 

UMM f2,<+/-iaM> //oqwvto a LDK(D)S <2,+l/+2* 

UMM tI,<+/-2/4/8> //equhr to *LDfC(D)S U.+2/+4" 
NOOP 

40 UMM t744fSeqDecPCpi..0] 

A fourth vectoring decode Op quad has the form: 

UMM (2,<+2/4> //cquivtD*LDKDt2,+2* 

UMM tl,<cSop32> 

45 UMM t7 v LoaSeqDecPCPi..O] 

//predicted RET Uigct adr 
//from Retain Addra Suck 
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Tl-^.wofo™ of vector^ Op^^for^ The first form is used for 

^ ^^^^^^^^^ 

emulation nqmmmitt for near RET instructions. 



Tfaem^mnstructi^d^ 

10 420, 422 ard 424, arJooe fetch rm^ counter 426 A fir* dec-d. -JT^ <^ Ingram counters 
„„ . »--•—. counter A first decode program counter, called an instruction 

FC420. U ^ ^ address of ^ «„, ^ M-^^p^^.r^a.^,,^ 
^ i^on is cor^ If 4e de«^ operaZ^ 

bececoded. ^« I* 420 corresp^ 

th. scheduler 260 wiu,«^^ 

Op qu^ is generated by the n.^croins^ 

°P^-k— tod-Sd^ A second decode program 

20 counter, called a logical decode PC 422, is the logical address of the next instruction byte to be decodedandT^ 
««ldres 8E8 e 1 ther„«^ b ^ or . pn!fixbyte . A third decode program counter, called a linear decode PC 
424, is u« linear ^ of r«t msh^ byto to te 

pxrfixbyto. The.ogica.dec^PC^.ndtbelWd^ ^ 
linear decode PC 424 designates the address of the instruction byte currently at the first byte rotator 430 
5 ^various decoders m^m^rDir^ 

cor^dtherprefuby^ocwhotem 

h^edasor^nutrucnons. Toerafore. ^ ^ ^ ^ ins^^p^^^ 
« more imports than in^^ Consequent*, a, ^beginning of each decode cycle, me 

n«tm S trucn M byte to bedecoaedi8«»t M c^ 

} At ttebegi^ of. decode cycled logical de^ 

logu* and Imear address of ^ ^ 

linear decode PC 424 is a primary program counter value out is used during the decoding process to access the 
n*n^b^408. The linear decode PC 424 represents the starting point for the decode of a cycle and 
spec,ficaUy controls^ byte rotator feedi^b^ 

• and to tr. long and vec^g decoders 416 «,d 418. The linear decode PC 424 also is the reference point for 
d«enr^ the instructional 

control signals the b)rte relators 



SUBSTITUTE SHEET (RULE 26) 



WO 97/13194 



17 



PCT/US96/15422 



The linear decode PC 424 also acts secondarily to check for breakpoint matches during the first decode 
cycles of new instructions, before prefix bytes are decoded, and to check for code segment overruns by the 
iiiscroiiistraction decoder 230 during successful instruction decode cycles. 

The logical decode PC 422 is used for program counter-related transfer control instructions, tnrJiutjng 
CALL mstiuctiaas. The logical decode PC 422 is supplied to the branch unit 234 to be summed with the 
displacement value of a PC-relative transfer control instruction to calculate a branch target address. The 
logical decode PC 422 also supports emulation code emulation of instructions. The next ™ qtu»*i»i logical 
decode program counter (PC) 422 is available in emulation code from storage in a temporary register by the 
vectoring Op quad for general usage. For example, the next awpinntial logical decode PC 422 is used to supply 
a return address mat a CALL instruction pushes on a stack. 

A next logical decode PC 428 is set to the next sequential logical decode program counter value and 
has functional utility beyond that of the logical decode PC 422. The next logical decode PC 428 directly 
furnishes the return address for CALL instructions decoded by the macroinstruction decoder 230. The next 
logical decode PC 428 also is passed to emulation code logic during vectoring decode cycles via one of the 
operations within the vectoring Op quad. 

During a decode cycle, (he linear decode PC 424 points to the next instruction bytes to be decoded. 
The four least significant bits of linear decode PC 424 point to the first instruction byte within the instruction 
buffer 408 and thereby directly indicate the amount of byte rotation necessary to align the first and subsequent 
instruction bytes in the instruction cache 214. The first byte rotator 430 is an instruction multiplexer, 
specifically a 16:1 byte multiplexer, for accessing bytes in the instruction buffer 408 that are offset by the linear 
decode PC 424 amount The first byte rotator 430 is seven bytes wide for the first short decoder SDecO 410 
and eleven bytes wide for the long decoder 416 and the vectoring decoder 418 in combination. Shared logic in 
the first short decoder SDecO 410, the long decoder 416 and the vectoring decoder 418 generate a first 
instruction length value IlenO for the first instruction. The second and third byte rotators 432 and 434 are seven 
byte-wide instruction multiplexers, specifically 16: 1 byte multiplexers. The second byte rotator 432 accesses 
bytes in the instruction buffer 408 that are offset by the sum of the linear decode PC 424 amount and the first 
instruction length ILenO (modulo 16). Logic in the second short decoder SDecO 412 generate a second 
instruction length value Heal for the second instruction. The third byte rotator 434 accesses bytes in the 
instruction buffer 408 that are offset by the sum of the linear decode PC 424 amount and the first and second 
instruction lengths IlenO and Ilenl. The byte rotators 430, 432 and 434 multiplex instruction bytes but not 
piedecode bits. The byte rotators 490, 432 and 434 are controlled using predecode information in which the 
predecode bits associated with the first opcode byte or the first byte of the first instruction directly controls the 
second rotator 432. The first byte of the second instruction directly controls the third rotator 434. Each 
predecode code implies an instruction length but what is applied to the next rotator is a pointer. The pointer is 
derived by taking the four least significant bits of the program counter at the present instruction plus the length 
to attain the program counter to the next instruction. 

All program counters 420* 422, 424 and 428 in the macroinstruction decoder 230 are mitialirnd during 
instruction and exception processing. A plurality of signal sources activate mis initialization. First, the branch 
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unit 234 supplies a target branch address when a PC-relative transfer control instruction is decoded and 
premctod taken. Second, a retain address stack (not shown) supplies a predicted return target address wben a 
near RET instruction is decoded. Third, the scheduler 260 generates a correct and alternate branch address 
when the m ac iuinrtr irt ion decoder 230, along with the remaining circuits in the processor 120, is restarted by 
the scheduler 260 due to a mispredicted branch condition (BRCOND) operation. Fourth, register unit 244, the 
primary RegOp execution unit, supplies a new decode address when a WRIP RegOp is executed. The WRIP 
RegOp execution allows emulation code to explicitly redirect instruction decoding. In all four cases, a logical 
address is supplied and utilized to simultaneously reinitialize the three decode program counters 420 t 422 and 
424. For the linear decode PC 424, a linear address value is supplied by adding the supplied logical address to 
the current code segment base address to produce the corresponding linear address for feeding into linear decode 
PC 424. The logical address is loaded into the current instruction PC 420 and the logical decode PC 422. For 
each decode cycle until a next reinitialization, the rnacroinstniction decoder 230 sequentially and synchxoriously 
updates the current inst r uc tio n PC 420, the logical decode PC 422 and the linear decode PC 424 as trrefrw-tion 
bytes are successfully decoded and consumed by the individual decoders of microinstruction decoder 230. 

Generation of the instruction lengths IlenO and Qenl occurs serially. To hasten this serial process by 
emulating a parallel operation, instruction length lookahead logic 436 quickly determines the instruction lengths 
DenO and Denl using four predecode bits specifying the length of each instruction byte in the instruction buffer 
408. The predecode bits a sso ciated with the opcode byte of the first instruction byte in the mstruction buffer 
408, the first m structi o n byte being multiplexed to the first short decoder SDecO 410, directly specifies a byte 
index of the opcode byte of the second mstruction byte in the instruction buffer 408. Hie predecode bits 
a ss o ciite d with the opcode byte of the second instruction byte in the frifrtntrttrm buffer 408, the second 
instruction byte being multiplexed to the second short decoder SDecl 412, directly specifies a byte inA »* of the 
opcode byte of the third instruction byte in the instruction buffer 408. The instruction length lf*>ir»K»»n logic 
436 includes two four-bit-wide 16:1 multiplexers for generating the byte indices of the opcode bytes of the 
second and third instruction bytes in tike instruction buffer 408. 

The instruction lookahead logic 436 also includes logic for determining validity of the sets of predecode 
bits. Predecode bits are valid when the amnriareri mstrnctkm byte ra the start nf ■ valid afr^t 4 rrr rrir 
mstruction. Specifically, the instruction look a he a d logic 436 determines whether predecode bits for a given byte 
in the instruction buffer 408 point to the same byte, implying a zero length for an instruction starting at that 
byte. If so, that byte is not the start of a short decode instruction and no further short decoding is possible. 
Otherwise, a short decode operation is possible and predecode bits point to the beginning of the next instruction. 

The predecod er 270 connected between the main memory 130 and the instruction cache 214 has eight 
logic units, each of whkh examines its a ssoci i ted instruction byte plus, in some cases, the following one or two 
instruction bytes. The first instruction byte is decoded as an opcode byte and the second and third instruction 
bytes, if the opcode byte is a rnodr/m opcode, are decoded as modr/m and sib bytes. Based on these three 
bytes, the length of an instruction and whether the instruction is classified as a "short* mstruction are 
determined. The length of the instruction is added to a four-bit fixed value corresponding to the position of the 
logic unit with respect to the sixteen logic units to determine the byte index used by the instruction length 



SUBSTITUTE SHEET (RULE 28) 



WO 97/13194 



19 



PCTAJS96/15422 



lookahead logic 436. This byte index is act is the value of the predecode bits if the instruction falls within the 
criteria of a short instruction. For instruction bytes not meeting the short instruction criteria, the predecode bits 
are set to the four-bit fixed value corresponding to the position of the logic unit with respect to the sixteen logic 
units without i ncrem ent to **~jjp«*» an instruction length of zero. An implied instruction length of zero is 
indicative that the instruction is not a short instruction. The p rede c ode bits are truncated from four bits to three 
since short decode instructions are never longer than seven bytes and the moat significant bit is easily 
reconstructed from the three predecode bits and the associated fixed byte address. The expansion from three to 
four predecode bits is perfo rmed by predecode expansion logic 440 having sixteen logic units co r responding to 
the sixteen instruction bytes of the instruction cache 214. The sixteen logic units of predecode expansion logic 
440 operate independently and simultaneously on predecode bits as the instruction bytes are fetched from the 
instruction cache 214 to the instruction buffer 408. 

The final two of the thirty-two instruction bytes that are piedecoded and loaded to the instruction cache 
214 have only one or two bytes tor examination by the predecoder 270. For modr/m opcodes the full 
instruction length cannot be determined. Thus logic units for bytes 14 and 15 in the predecoder 270 are 
modified from logic units for bytes 0 through 13. For instnictton byte 15, logic unit 15 of the predecoder 270 
forces an instruction length of zero for all modr/m opcodes and for non-short decode instructions. For 
instruction byte 14, an effective instruction length of zero is forced for modr/m opcodes with an address mode 
requiring examination of a sib byte to reliably determine instruction length, as well as for non-short instructions. 

During each decode cycle, the macroinstruction decoder 230 checks for several exception conditions, 
jnrJmting m instruction breakpoint, a pending nonmaskable interrupt (NMI), a pending interrupt (INTO), a code 
segment overrun, an instruction fetch page fault, an instruction length greater than sixteen bytes, a nonlockable 
instruction with a LOCK prefix, a floating point not available condition, and a p*™*' w g floating point error 
condition. Some conditions are evaluated only during a successful decode cycle, other ^'tim are evaluated 
irrespective of any decoding actions during the cycle. When an active exception condition is detected, all 
instruction decade cycles including short, long and vectoring decode cycles, are inhibited and an "exception" 
vectoring decode is forced in the decode cycle following exception detection. The recognition of an exception 
condition is only overridden or inhibited by inactivity of the macroinstruction decoder 230, for example, when 
emulation code Op quads are accepted by the scheduler 260, rather than short and long or vector decoder Op 
quads. In effect, recognition and handling of any exception conditions are delayed until an ERET Op seq 
returns control to the rnacroinstruction decoder 230. 

During me decode cycle mat forces exception vectoring, a special emulation code vector address is 
generated in place of a normal instruction vector address. The vectoring Op quad that is generated by the long 
and vectoring decoder s 416 and 418 is undefined. The exception vector address is a fixed value except for low- 
order bits for identifying the particnlaT exception condition that is recognized and handled. When multiple 
exception conditions are detected simultaneously, the exceptions are ordered in a priority order and the highest 
priority exception is recognized. 
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Tl* instruction b«kpc^^ 
linear decode PC 424 points to the first byte of an instruction including prefixes, the linear decode PC 424 
.raate.brertpoint.dd^ 

mxsk flags are clear. One mask flag (RF) specifically trasks recognition of ir^tru^ breakpoints. Another 
5 mask flag (BNTF) temporarily masks NMI requests and instruction breakpoints. 

The pending NMI exception, the pemthimate priority exception, is recognized when an NMI request is 
1^-nd none <>fd» NMI n^f^«cle»-. One mask (NF) specifically masks nonmaskable interrupts. 
Another mask flag (BNTF) teaporeriry iniafa NMI ren^ 

The pending INTR exception, the next exception in priority foDowing the pending NMI exception, is 
10 recogmzed^mlNrarequ^ 
dear. 

The code segment overrun exception, the next exception in priority following the pending jjsfTR 
exception, is feoogiiizirf when the ma^^ 
instructions beyond a current code segment limit 

The instruction fetch page fault exception, having . priority immediately lower than the code segment 
overrun exception, is recognixed when the niacxoinstruction decoder 230 requires additional valid instruction 
bytes from the instruction buffer 408 before decoding of another instniction or prefix byte is possible and the 
bstructko translation 

fetch. A faulting condition of the instruction fetch control circuit 218 is repeatedly retried so that the ITB 
continuaUy reports a page fault until the page fault is recognized by the macroinstraction decoder 230 and 
subsequent exception handling processing stops and redirects instruction fetching to a new address. The fault 
mdmation from the ITB has the same timing as instructions loaded from the mstruction cache 214 and, 
therefore, is registered in the subsequent decode cycle. The ITB does not necessarily signal a fault on 
consecutive mstn«^ fetch atte^ indication until 

fetching is redirected to a ixswinst^^ Upon recognition of a page fault, additional fault 

information is loaded into a special register field. 

The mstruction length greater than sixteen bytes exception, which has a priority just below the 
instruction fetch page fault exception, is recognized wben the macroinstruction decoder 230 attempts to 
successfuUy decode an instructs ^ ^ 

mstruction length greater than sixteen byte* exception is detected by counting the number of prefix bytes before 
an actual instruction is decoded and computing the length of the rest of the instruction when it is decoded. If 
the sum of the prefix bytes and the nmaining instruction length is greater than sixteen bytes, an error is 
recognixed. 

The wmlcckable instruct below the mstruction 

length exception, is recognized when the riuttroinstruction decoder 230 attempts to successfully decode an 
mstnictionhavmg.UXX prefix, m 

The nonlockable LOCK instroction exception is detected based on decode of the opcode byte and existence of a 
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OF prefix* The DOolockaHe LOCK instruction exception only occurs during vectoring decode cycles since the 
LOCK prefix inhibits short and long decode s. 

The floating point not Available exception, having a next to lowest priority v is recognized when die 
macroinstroctioo decoder 230 attempts to successfully decode a WATT instruction or an ESC instruction that is 
on! a processor control BSC, and the reporting of a floating point error is pending. Macroinstruction decoder 
230 detects the floating point not available exception based on decoding of an opcode and modr/m byte, in 
addition to die existence of a OF prefix* 

During each decode cycle, the macroinstruction decoder 230 attempts to perform some form of 
instruction decode of one or more instructions. Typically, the macroinstruction *V*»4rf 230 snrmrris in 
performin g either one or multiple short decodes, one long decode or an instruction vectoring decode. 
Occasionally no decode is successful for three types of conditions including detection of an active exception 
condition, lack of a sufficient number of valid bytes in die instruction buffer 408, or the macroinstruction 
decoder 230 does not advance due to an external reason. 

When an active exception condition is detected all forms of instruction decode are inhibited and, daring 
the second decode cycle after detection of the exception condition, an exception vectoring decode cycle is 
forced, producing an invalid Op quad. 

When an insufficient number of valid bytes are available in the instruction buffer 408 either no valid 
bytes are held in the instruction buffer 408 or at least the first opcode is valid and one of the decoders decodes 
the instruction but the decoded instruction length requires further valid bytes in the instruction burTer 408, not 
all of which are currendy available. 

When an external reason prevents macnnristructkm decoder 230 advancement either the scheduler 260 
is full and unable to accept an additional Op quad during a decode cycle or the ffrhMiiW 260 is currently 
accepting emulation code Op quads so that the macroinstruction decoder 230 is inactive awaiting a return to 
decoding. 

In die latter two cases, the decode state of the macroinstruction decoder 230 is inhibited from 
advancing and the macr o in struction decoder 230 simply retries the same decodes in the next decode cycle. 
Control of maoomstructjon decoder 230 inhibition is based on die generation of a set of decode valid signals 
with a signal c orresp onding to each of die de cod er s. For each decoder there are multiple reasons which are 
combined into decoder valid signals to determine whether that decoder is able to successfully perform a decode. 
The decoder valid signals for all of the decoders are then monitored, in combination, to determine the type of 
decode cycle to perform. The type of decode cycle is indicative of the particular decoder to perform the 
decode. The external considerations are also appraised to determine whether the selected decode cycle type is 
to succeed. Signals indicative of die sriocfnrt type of decode cycle select b etw e en various signals internal to the 
macrouistruction decoder 230 generated by die different decoders, such as alternative next decode PC values, 
and also are applied to control an Op quad multiplexer 444 which selects the input Op quad applied to the 
scheduler 260 from the Op quads generated by the short decoders, the long decoder 416 and the vectoring 
decoder 418. 
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Id the case of vectoring decode cycles, the macrouistruction decoder 230 also generates signals that 
initiate vectoring to an entry point in either internal emulation code ROM 232 or external emulation code RAM 
236. The macroinstruction decoder 23t then monitors the active duration of emulation code fetching and 
loading into the scheduler 240. 
5 The instruction decoder 220 includes the branch unit (not shown) for performing branch prediction so 

that operations are speculatively executed. Per fon nan ce of an oot-oforder p rocesso r is enh a nced when 
branches are handled quickly and accurately so that pipcline^iraining mispredictions are avoided. The processor 
120 employs a two-level branch prediction algorithm that is riisrlosnri in detail in U.S. Patent No. 5,454,117, 
entitled CONFIGURABLE BRANCH PREDICTION FOR A PROCESSOR PERFORMING SPECULATIVE 

10 EXECUTION (Puziol et al., issued September 26, 1995), U.S. Patent No. 5,327,547, entitled TWO-LEVEL 
BRANCH PREDICTION CACHE (Stiles et al., issued July 5, 1994), U.S. Patent No. 5,163,140, entitled 
TWO-LEVEL BRANCH PREDICTION CACHE (Stiles et al., issued November 10, 1992), and U.S. Patent 
No. 5, 093,778, entitled INTEGRATED SINGLE STRUCTURE BRANCH PREDICTION CACHE (Favor et 
al., issued March 3, 1993). The processor 120 further utilizes an 8,192-entry branch history table (BHT) (not 

15 shown) which is indexed by coaming four program counter bits with nine bits of global branch history. Each 
BHT entry contains two history bits. The BHT is a dual-port RAM allowing bom a read/lookup access and a 
whte/update access. BHT lookups and updates do not conflict since they take place in opposite half phases of a 
clock cycle. The large number of entries of the BHT is supplied in a reasonable integrated circuit area because 
the BHT is only predicting conditional branch directions so mat entries are not tagged and predicted branch 

20 target addresses are not stored, except for a 16-entry return address stack (not shown). Accordingly, an access 
to the BHT is similar to a direct mapping into a cache-like structure in which the BHT is indexed to access an 
entry in the BHT and the accessed entry is presumed to be a branch instruction. For branches other man 
returns, the target address is calculated during the decode cycle. The target address is calculated with sufficient 
speed using a plurality of parallel adders (not shown) that calculate all possible target addresses before the 

25 location of a branch instruction is known. By the end of the decode cycle, the branch unit 234 determines 
which, if any, target address result is valid. 

If a branch is predicted taken, the target address is immediately known and me target instructions are 
fetched on the following cycle, causing a one-cycle taken-branch penalty. The taken-branch penalty is avoided 
using a branch target buffer (BTB) 456. The BTB 456 includes sixteen entries, each entry having sixteen 

30 instruction bytes with associated predecode bits. The BTB 456 is indexed by the branch address and is accessed 
during the decode cycle. Instructions from the BTB 456 are sent to the instruction decoder 220, eliminating the 
taken-branch penalty, for a cache hit of the BTB 456 when the BHT predicts a taken branch. During each 
decode cycle, the linear decode PC 424 is used in a direct-snapped manner to address me BTB 456. If a hit, 
which is realized before the end of the decode cycle, occurs with a BTB entry, a PC-relative conditional transfer 

35 control instruction is decoded by a short decoder and the control transfer is predicted taken, then two actions 
occur. First, the initial target linear fetch address directed to the instruction cache 214 is changed from the 
actual target address to a value which points to an instruction byte immediately following the valid target bytes 
contained in the BTB entry. This modified fetch address is contained in the BTB entry and directly °rr«ff3*«< 
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from the BTB entry. Second, die instruction byte and predecode mrormation from the entry is loaded into the 
instruction buffer 408 at the end of the decode cycle. If a PC-relative conditional transfer control instruction is 
d eco d ed by a short decoder and the control transfer is predicted taken* but a miss occurs, then a new BTB entry 
is created with the results of the target instruction fetch. Specifically, simultaneously with the first successful 
5 load of target instruction bytes into the instruction buffer 408 from the instruction cache 214, the same 

information is loaded into a chosen BTB entry, replacing the previous contents. The target fetch and instruction 
buffer 408 load otherwise proceed normally. 

Each entry includes a tag part and a data part The data part holds sixteen extended instruction bytes 
inrJmting ft memory byte and three associated predecode bits. The correspondence of the memory byte is 

10 memory-aligned with the corresponding instruction buffer 408 location. The tag part of a BTB entry holds a 
30-bit tag including the 32-bit linear decode PC 424 associated with the transfer control instruction having a 
cached target, leas bits [4:1], an entry valid bit and the 30-bit modified initial target linear mstruction fetch 
address. No explicit instruction word valid bits are used since the distance betwe en the true target address and 
the modified target address directly implies the number and designation of valid instruction words within the 

15 BTB 456. 

The purpose of the BTB 456 is to capture branch targets within small to medium sized loops for the 
time period a loop and nested loops are actively executed. In accordance with mis purpose, at detection of a 
slightest possibility of an inconsistency, the entire BTB is invalidated and flushed. The BTB 456 is invalidated 
and flushed upon a miss of the mstruction cache 214, any form of invalidation of instruction cache 214, an ITB 

20 miss, or any form of ITB invalidation. Branch targets outside temporal or spatial locality are not effectively 

cached. Typically, the BTB 456 contains only a small number of entries so that complexity is reduced while the 
majority of performance benefit of ideal branch target caching is achieved. 

PC-relative branch target address calculation logic p er forms the target address calculation. . Branch 
target address calculation logic is utilized only for PC-relative transfer control instructions that are decoded by a 

25 short decoder SDecO 410, SDecl 414 or SDec2 416. Specifically, the branch target address calculation logic is 
utilized for the short decode branch instructions including Jcc disp8 ( LOOP disp8, JMP disp8, JMP displ6/32, 
and CALL disp 16/32. Bach short decoder SDecO 410, SDecl 412 and SDec2 414 includes logical and linear 
branch target address calculation logic (not shown). All three sets of logical and linear branch target address 
calculation logic function in parallel while the short decoders 410, 412 and 414 determine whether any of the 

30 npgtrrtiqng is a PC-relative short decode branch instruction. The logical and linear branch target address 
calculation logic sum the logical program counter of the branch, the length of the branch instruction and the 
sign-extended displacement of the branch instruction and conditionally mask the high-order 16 bits of the sum, 
depending on calculation sizing, to produce a logical target address. The logical and linear branch target 
address calculation logic sum the logical target address with the current code segment base address to produce a 

35 linear target address. If the branch is taken, either uncondhkmally or predicted taken, then calculated addresses 
erwmipnndmg to the decoded short decode branch instruction are used to reinitialize the logical decode PC 422 
and linear decode PC 4Z4. If the branch is predicted not taken, the logical address is saved with the associated 
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sta* decode branch mstr^ The logical target address is 

compBtsd to the current code segment limit value to monitor for a limit violation. 

If the logical and linear branch target address calculation logic detects a limit violation, whether me 
branch is predicted taken or predicted not taken, then a special tag bit indicative of the limit violation is set in 
5 the scheduler 260 Op quad entry holding the operations generated from the branch instruction. Subsequently, 
when the operation commit urit (OCU) W» tf the schedule* 2« 
isr^eda8Corit«iii^.f«atand«b(H^- The rBacroir*tnictk» d 

vectoring to. fault handler b enmlatioo code ROM 232. The fault handler temporarily inhibits decoding by the 
ihort and long decoders and jump, to the fault PC address of the violating instruction associated with the faulted 
10 Op quad. Ultimately, the branch mstnicti^ ^ 
eamlanon code rettJgnizes the lixnitW 

violation. 

The processor 120 generally responds to a fault conditio by vectoring to a specific fault handler in the 

emulation code ROM 232. The fault handler include, operations defined vyitbin the RBC86 inatn^« 
15 which r^onn. routine d^drfer^ 

in.b^u* appropriate response. As an Iterative in approprfate cases, tto proc^ ^ ^so mcludes «. 

instruction called a -load alternate fault handler' instruction which initiates a special fault response. The toad 

alternate f^t handler u^truction thrc^ 

instructions, but causes an^ 
20 point. Ttealtern^fauUhandTerterm^ 

alternate fault handler is advantageous to allow special iia* h«rilmg only to ^ 

avoid special setup work if the fault occurs. 

One example of the advantage of the alternate fadt handler aris« vnfc iwpert to a re 
25 string instruction (REP MOVs). To perform multiple mteractions very quickly, a capability to reorder the 
sea^ofc^c^isiinportant. Tlie sequenc* of operations typicaUy 

designated address,, store to. if bom tr^k^«.d sUm.^ successful, 

«nir«enient of the first arulsec«^ For ^tional efficiency, the 

pointe»a»incieii*^andthe However, if a 

30 fault or exception occuniAuing^ 

pointers in the wrong state. For example, the architectural »86 SI, Dl and CX registers do not have the correct 
vdue. The alternate fault handler U used by st^ryir*. p^ fault handler that 

performs, deanupafter a repeated move fault. The «*meoce proceeds without Ito overhead of intenr-diate 
htttruc^fbrtr^gto If no errors occur, the alternate fault handler tenninates 

35 without effect However, if an ™ 
faspecUfctofcep-ucmarseq^of 
faulthandkr. Advantageously, highly elfW^ 

alternate fault handler until an error arises, at which time the error is addressed. 
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. .1.. r,< i..,i -j derated is detenmned at the end of me 

diwtion.almougnmeacrtidiii^^ 

• If the branch is actually token, the state machine is decremented unless 

time the state machine is updated. If tne wancn m mxwj 

, . , , . u„ t-u—, the rf a t>, nwriiiiwt is incremented unless 

al^y at the inaximum value 0). If the branch is actuaUy not taken, the state 

, ,n\ A^winrfv . state inachine value of Oaiul lresp^ 
.beady atamiiunBimvahie(0). Accordingly, a ™— — 

_ j: . , state machine value of 2 and 3 respectively indicate a mild and 

30 ^.niildtmArtion of. bra«ii>ot token. stole macmnc 

.^predic^of.br^ token. To support updating of BHT entnes, a copy of 

260 atagiriu^tne bianch condition (BRCOND) operation. Since a maximum of one is included in 

35 JL^r^circ^^ 

linear decode PC 424 associated with decoded conditional branches) mat are typical in cache structures. 

^ad^nagec^mattheB*^ 
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The information saved in a scheduler 260 Op quad along with an associated BRCOND operation has a 
width of fifteen bits including four branch address bits, nine current history bits, and the immediately reused 
two state ww>ti «~* bits, the upper bit of which is also the predicted direction for the wnm— Kq»» branch* The 
first thirteen bits are used, when necessary, to reaccess the BHT and to update a state machine value. The final 
5 two bits are modified to create the new state machine value. 

When a branch is mispredicted, the set of history values in die nine-bit branch history shift register are 
corrected to reflect die actual direction taken by the branch. Furthermore, the shift register is "shifted back" to 
correspond to the nuspredicted branch, then npdated based on the actual branch direction and which branch 
direction was predicted. 

10 A return address stack (RAS) (not shown) is a target address cache for return (RET) transfer control 

instructions. RAS is an eight entry, 32-bit wide, single-ported RAM that is managed as a circular buffer using 
a single three-bit pointer. During each cycle at most one access, either a read access for a RET decode or a 
write access for a CALL decode, is performed. RAS caches RET return addresses and predicts return 
addresses which inherently specify a target address indirectly, in contrast to other transfer control instructions 

15 that contain a direct specification of target address. RAS is advantageously utilized since a particular RET 
instruction often changes target address between different executions of die instruction. RAS discovers and 
anticipates the target address value for each RET instruction execution through monitoring of the return 
addresses that are saved - pushed on a stack - by CALL instructions. Corresponding CALL and RET 
instructions typically occur dynamically in pairs and in last-in-first-out UFO order with respect to other CALL 

20 and RET instruction pairs. 

Each time a CALL instruction is successfully decoded, the logical return address of die CALL 
instruction is saved (pushed) to a circular buffer managed as a UFO stack. Each time a RET instruction is 
successfully decoded, the return address value currently on the top of the RAS is employed as the predicted 
target address for the RET and the value is popped from the RAS. RAS achieves a high prediction rate 

25 although mispredictions do occur because CALLs and RETs do not always occur in nested pairs, only near 
CALLs and RETs and not &r CALLs and RETs are supported, and mispredictions occur because of die finite 
depth of the RAS. When a conditional branch misprediction occurs, RAS attempts to restore the state prior to 
misprediction by setting the top of stack pointer to die previous condition because CALL and RET instructions 
may have been speculatively decoded and die top-of-stack pointer thereby modified. The original pointer, 

30 before the misprediction, is to be restored. Restoration following a rnisprediction is supported by the scheduler 
260. Each scheduler 260 Op quad is tagged with a current initial top-of-stack pointer value in effect during the 
decode cycle in which the Op quad was generated. When the BRCOND Op generated for a conditional branch 
instruction is resolved and found to be mispredicted, the top-of*«tack pointer tagged to die scheduler Op quad is 
supplied to the RAS during a restart cycle that is generated by die scheduler 260. RAS replaces the current top- 

35 of-stack value with die schednlw Op quad top-of-stack pointer tag. 

Referring to Figure 5, a schematic block diagram depicts an instruction decoder emulation circuit 231 
tnehidmg m instruction register 512, an entry point circuit 514, an emulation code g*g"»?rer 510, an emulation 
code memory 520 and an Op substitution circuit 522. The instruction decoder emulation circuit 500 is a circuit 
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within the instruction decoder 220. The instruction decoder emulation circuit 231 receives instruction bytes end 
ftf t ^afrfi predecode information from the instruction buffer 408 coronet**! to the instniction fetch control 
circuit 218, the FIB 456 or the instruction cache 214. The instruction buffer 408 is connected to and supplies 
frgtniction legiater 512 wiA jS^initngfeo ^ The msfiirebon register 512 is connected tothc entry joint 
5 circuit 514 to supply rrr iTlUrin ' 1 code ROM entry points. The entry point circuit 514 receives the x86 instruction 
^~and, from the x86 instruction operation code (opcode), generates an entry point address, a beginning address 
pnmtitig into the emulation code memory 520. In mis mann er an address of an instru ction in emulation code 
memory iff ffyn*!--*"^ fromthc opcode of an x86 instni ctio n. The address is derived based on the x86 
instnjgfaonliytoT^^ and second bytes of the x86 instniction as weD as mfbrmation such as the 

10 sm odem byte, prefixes REP and REE E^Jhe -protectod mode bit aiu r^ffcctivcii fata size bit DSz. Generally^ 
closely related x86 instructio ns have simil arly coded bit fields, for example a bit field indicative of instruction 

i entry m the emulation code memory 520 



^ corresponds to several x86 instructions. Entry points are generally synthesimt by reading 1 

and Mgigniwg bits of the entry point address according to the values of particular x86 mst ruction bitj 



15 instruction register 512 is connected to the cmuTationncooclegucDcer 510 which, 



**miiati*w* cndii mmnry gfl. The emulation code seqnmnrr 510 applies the entry point to the emulation code 

ion from the emulation code memory 520. The emulation code 



20 



memory 520 and receives wtqumctng i 



sequencer 510 either controls the sequencing of instructions or, wnen a new s eq u ence is to be started, applies an 



entry point tA thr rTr" llQ +™"codc memory $210. Operations- r^ups) encoded m me emulation code memory I 

* _ 

are output by the —™*t«ri«» code memory 520 to the O p substitution circuit as Op o r unite Th* Ops > 

correspond to a template for RISC-type x86 operation. Has template includes a plurality of fields into which 



25 



30 



35 



codes are selectively substituted. The emulation coo? memory 520 is connected to the Op substitution ciicuit 
522 to supply Ops into which the various Op fields are selectively su b sti tut ed. Functionally, the entry point 
circuit 514 cwkii fote? an entry point into the emulation code ROM 232 or emulation code RAM 236. The 
sequence in emulation code ROM 232 determines the fimctionality of an mstruction. 

The emulation code memory 520 includes an on-chip emulation code ROM 232 and an external 
<»fmiiflrittn code RAM 236. The emulation code memory 520 includes encoded operations that direct how the 
processor 120 functions and defines how x86 instructions are executed. Both the emulation code ROM 232 and 
RAM 236 include a plurality of operation (Op) mstruction encodings having a Op coding format that is the same 
in ROM 232 and RAM 236. For example, in one embodiment the emulation code ROM 232 has a capacity of 
4K 64-bit words. The Op coding format is typically a format defined in 30 to 40-bits for example. In one 
embodiment, a 38-bit format, shown in Figures 6A through 6E, is defined. The emulation code ROM 232 base 
address location within the fm" 1,tinw space is fixed. The external emulation code RAM 236 is resident in 
standard memory address space within cacheable memory. The emulation code RAM 236 base address location 
within the emulation space is fixed. The 32-bit emulation code RAM 236 address is formed by the fixed base 
address of the emulation code RAM 236 which supplies bits the most significant fifteen bits <31:17>, and the 
Op address which furnishes fourteen bits < 16:3 > concatenated to the base address bits. The two least 
ftignifiren* bits < 1:0> of the emulation code RAM address are set to zero. The fourteen bit Op address in 
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emulation code RAM 236 is the same as the Op address in emulation code ROM 232. Operations (Ops) are 
stored in Op coding format, for example 38-bits, in the external emulation code RAM 236 in 64-bit words. Bits 
in excess of the Op coding format bits of the 64-bit words are used to store control transfer (OpSeq) 
information. The external emulation code RAM 236 is typically used for test and debug purposes, allowing for 
5 p^^g of any instruction encoded in the emulation code ROM 232, and for im plementing special functions 
such as system inanagement mode (SMM). For example, if an instruction in emulation code ROM 232 is found 
to function impr op er ly, the external emulation code RAM 236 is accessed to temporarily or permanently 
substitute for the improperiy^fimcttoning fixed code in the emulation code ROM 232. Access to the external 
emulation code RAM 236 is typically gained using one of two techniques. In a first technique, a one-bit field in 

10 an OpSeq field of an element of emulation code memory 520 designates that the next address for fetching 

instructions is located in external emulation code RAM 236. In this first technique, the on-chip emulation code 
ROM 232 initiates execution of the external emulation code RAM 236. In a second technique, a vector address 
is simply supplied for vectoring to an entry point in the emulation code RAM 236. 

The instruction cache 214, instruction fetch control circuit 218 and instruction decoder 220 function in 

15 three instruction fetch and decode modes. In a first mode, the instruction decoder 220 fetches emulation code 
Op quads from the on-chip emulation code ROM 232. Each Op quad includes four operations (Ops) phis 
control transfer information (OpSeq) for determining die next cycle of fetch and decode function. In a second 
mode, the instruction fetch control circuit 218 controls retching of x86 inacroinstruction bytes from the 
instruction cache 214, which is part of the on-chip LI instruction cache 214. The x86 macroinstructions are 

20 decoded by the macroinstruction decoder 230, which generates four operations (Ops). Four Ops plus the OpSeq 
field form a full Op quad. The instruction decoder 220 performs any coded control transfers using the branch 
unit 234 and vectoring functionality of the macroinstruction decoder 230. In a third mode, the instruction fetch 
control circuit 218 controls fetching of 64-bit words containing emulation code in Op coding format from the 
instruction cache 214, one 64-bit word per cycle. Each 64-bit word corresponds to a single operation (Op). In 

25 other embodiments, a plurality of 64-bit words may be accessed per cycle. An embodiment in which four 64- 
bit words are accessed, the emulation code RAM 236 supplies a full Op quad in the manner of the on-chip 
emulation code ROM 232 so that a fully-reprogrammable processor with full efficiency is achieved. A fully- 
reprogrammable processor advantageously permits soft implementation of greatly differing processors, for 
example an x86 pr oces sor and a PowerPC" in a single hardware. 

30 In the first and third operating three modes, control transfer information is formatted into an operation 

sequen ci ng (Opseq) field of the Op quad. Unconditional control transfers, such as branch (BR) and return from 
emulation (ERET) operations, are controlled completely using the Opseq control transfer information. 
Conditional transfers, such as branch on condition (BRcc), are controlled using a combination of the Opseq field 
and a branch condition (BROOND) operation. An OpSeq field format graphic is shown in Figure 7. The 16- 

35 bit OpSeq field 700 includes a two-bit sequrnr.ing action (ACT) field 710, a single-bit externa] emcode field 712 
and a 13-bit operation (Op) address field 714. Four sequencing actions in the ACT field 710 are encoded, as 
follows: 
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ACT Operation Description 

00 BB/BRcc imranri/cood bcmacfa to ii aVi td dw 

0 1 BS&TBSJtcc uacoad/cood "call" id wacode address 
10 BBBT/BSBTcc mam from wilitinn to tot decoding 

5 1 1 SHOT/SHETcc mam from cmnbrinn ofl 

Whether a sequencing action of an OpScq is unconditional or conditional depends on the presence or 
absence, respectively, of a branch condition (BRCOND) Op elsewhere within the Op quad. The BRCOND Op 
within die Op quad specifies die con di tion to be tested and the alternate emulation code target address. No 

10 explicit static branch direction prediction bit exists. Instead the predicted action and next address are always 
specified by the QpSeq field 700 and die "not predicted" next address is always specified by the BRCOND Op. 
A BRCOND Op is always paired with a BSR semiencing action including unconditional calls. For 
unconditional and conditional "predicted-taken* calls, die BRCOND Op specifies the return address to be saved. 
The external emcode field 712 is set to one if emulation code to be executed is located in external 

15 emulation code RAM 236. The external emcode field 712 is set to zero if emulation code to be executed is 
located in internal emulation code ROM 232. The Op address field 714 designates an address of a target Op 
within a non-entry point Op quad. 

The Opseq control transfer information controls unconditional control transfers when an Op quad or 64- 
bit memory word is fetched and arranged or "instantaneously decode d ". Designation of the next instruction to 

20 be decoded is controlled by the Opseq field alone. The Opseq field specifies one of three alternative actions. 
First, die Opseq field directs fetching of emulation code from emulation code ROM 232 at a specified 14-bit 
single operation word address so that an emulation code ROM 232 Op quad is fetched. Second, the Opseq field 
directs fetching of emulation code from emulation code RAM 236 at a specified 14-bit single operation word 
address so that an emulation code RAM 232 64-bit memory word is fetched. Third, the Opseq field includes a 

25 return from emulation (ERET) directive, which directs the instruction decoder 230 to return to x86 
nikroinstruction decoding. 

Emulation code fetched from the emulation code ROM 232 is fetched in the form of aligned Op quads. 
A branch to an intermediate location within an Op quad causes the preceding operations within the Op quad to 
be treated as invalid by fetching NOOPs in place of the preceding operations. 

30 The byte memory addresses for fetching 64-bit memory words from emulation code RAM 236 are 

created by concatenating a specified 14-bit operation address with three least significant bits set to zero, thereby 
creating an aligned 8-bit address. The byte memory addresses for fetching 64-bit memory words are 8-bit 
aligned, thus rendering memory Op *w***i"g and fetch/decode advancement consistent and simple. 

The Opseq control transfer information also controls designation of the immediate next instruction to be 

35 decoded for conditional control transfers. The branch condition (BRCOND) operation specifies the condition 
code to be tested and evaluated and specifies an alternative 14-bit emulation code fetch and decode address. 
Thus, Opseq control transfer information for conditional control transfers effectively specifies die predicted path 
of the conditional branch. The BRCOND address typically is either the 14-bit target Op word address or the 
14-bit Op word address of the next "sequential" operation (Op). More generally, the BRCOND address may 

40 specify a fully general two-way conditional branch. A conditional ERET operation is implemented by setting 
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the Opseq field to specify an KKbl operation so thai the conditional ERET is predicted taken. If the ERET 
operation is subsequently found to be mispredicted, then the x86 macroiiistruction stream directed by the ERET 
is shorted snd the anqnratial mscroinstroction stresm specified by the BRCOND operation is restarted. 

BRCOND opersiicms are losded into the scheduler 260 in an unissued state. BRCOND operations are 
5 evaluated in-order by the branch resolution unit of the scheduler 260. If the branch is properly predicted, the 
branch is marked Completed. Otherwise, die BRCOND state is left unissued and triggers a branch abort signal 
when detected by die Op commit unit 

The emulation code memory 520 supports a single-level (no nesting) subroutine functionality, in which 
an Opseq field is set to specify alternatives for fetching emulation code. The alternatives are structured as a 
10 typical two-way conditional branch, except that a 14-bit Op word address from the field of a 

BRCOND Op within the Op quad or memory Op is loaded into a subroutine return address register. The 
subroutine return address register stores the 14-bit Op word address plus a single bit which designates whether 
the return address is located in emulation code ROM 232 or RAM 236. The condition code specified by the 
BRCOND Op may be any alternative, including TRUE, so that both unconditional and conditional (predicted- 
15 taken) subroutines may be specified. However, the BRCOND Op must be specified to avoid loading an 
undefined value into the subroutine return address register. 

All fnH*ti"n code subroutine support and return address register management is perf ormed by the 
emulation code sequencer 510 at the front of the pipeline. Thus return address register loading and usage is 
fully synchronous with standard decoder timing so that no delays are introduced. 

20 

Two-Wav Emulation Code Branching 

The emulation code ROM 232 is storage for a plurality of sequences of operations (Ops). An operation 
sequence begins at a defined entry point that is hard-coded into the emulation code ROM 232 and extends to a 
return from emulation (ERET) Opseq directive that ends the operation sequence. The number of operations in a 

25 sequence is typically variable as is appropriate for perfo r min g various different functions. Some simple x86 

instructions have only a single Op entry in the emulation code ROM 232, although these instructions are fetched 
with Op quad granularity. Other more complex x86 instructions use many component operations. The 
emulation code ROM 232 storage is structured as a plurality of Op quads, programmed into a fixed ROM 
address space. Each Op quad includes four RISC Op fields and one Opseq field. The operation sequences are 

30 typically not aligned within the Op quads so that, absent some technique for branching to interspersed locations 
in *m"lntv»" code ROM 232, many ROM units in the emulation code ROM 232 are unusable, wasting valuable 
integrated circuit space, Furthermore, because the entry point address of an instruction in emulation code ROM 
732 is synthesized from the opcode of an x86 instruction, me entry point addresses often are forced into fixed 
positions spread throughout the ROM address space with intervening gaps that lead to unused portions of ROM. 

35 ROM positions that are left without access via an entry point are free for other usage but are not conveniently 
sequential to allow access. The OpSeq field provides a technique for branching to these interspersed locations, 
thereby substantially eliminating wasted space. 
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Each of the four RISC Op fields of an Op quad stores a ample, RISC4fo operation. Hie OpSeq field 
atom a control code tint is r ™n if™ ; ~*~ 1 to the emulation code sequencer 510 and directs the enmlation code 
sequencer 510 to branch to a next location in the emulation code ROM 212. Each of the four RISC Op fields 
m the emulation code ROM 232 may store a branch operation, either c owti t io na l or unconditional, and thereby 
5 specify a target address so that a plurality of branches may be encoded in a single Op quad. In some 

fmho'tim-ifr of the instroction emulation circuit 231, the Op quad is limited to having at most a single branch 
operation to direct Op order in combmatico wim the The combination of a conditional branch Op 

in one of the four RISC Op fields and the OpSeq field in an Op quad yields an Op quad with two possible target 
or next addresses. 

10 For an Op quad having a plurality of target addresses, the emulation code s equ en cer 510 directs the 

sequence of operations by selecting a bard-coded, predicted target address. Thus, for an Op quad including an 
Cfflidi fM Mtal hmch, the emulation code aeqnoncer 510 selects a hardcoded OpSeq target address in preference 
over the conditional branch. The unconmtional branch is subsequently handled in accordance with the branch 
prediction functionality of the processor 120 so that no additional branch handling overhead is incurred by 

15 implementing two-way emulation code branching. 

Cm,,UhA» microcode may be written so that a BRCOND Op is posi tio ned in one of the first three 
positions of the Op quad. Thus, the Ops following the BRCOND Op within the Op quad are executed based on 
the predicted direction rather than whether the branch is ultimately taken. If the branch is ultimately found to 
be correctly predicted; aQ of the Ops of the Op quad and the Ops of siibeequent Op quads are committed to the 

20 scheduler 260. If the branch is ultimately found to be niisptedicted, all of the Ops following the BRCOND Op 
plus all ffiifrfffgyitrt Op quads are aborted. The ernulation code is supplied to include the branch conditions, a 
target address, a "sequential" address, and also the prediction for the branch c on di t i on - 

For most instructions, the OpSeq field alone supplies the next fetch address, either a vector address or 
an ERET return. This is advantageous for simplifying the control hardware, supplying a fast and simple control 

25 logic that controls operation fetching without analysis of conditional branching or branch prediction. For 

conditional branch instructions, the OpSeq field supplies a predicted branch address and a BRCOND operation 
in the quad specifies a condition code to be evaluated and specifies the alternate fetch address in case of 
misprediction. The ernulation code handling advantageously achieves the flexibility of n on sequen t ial Op quad 
fetching in which the instroction control sequence is selected by the program wim few constraints. 

30 Accordingly, the OpSeq field is advantageously used to efficiently fit Op se que nces into nonused locations in the 
emulation code ROM 232. The emulation code handling also includes the flexibility of optional two-way 
branching in the case of conditional branches and also in the case of a subroutine call which may be directed to 
return to substantially any location rather man being constrained to return to the instruction following the calling 
instruction. This usage of the OpSeq field to branch to a target address advantageously achieves tinamditional 

35 branching without incurring a time or cycle penalty. 
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101010 


MOVE MULEB 


nwhyty h\jh/}nm date 


101011 


MOVE MULEL 


maiupljr high/low dm 


101 100 


MOVE DIV1 


divide 


101101 


MOVE DJV2 


divide 


101110 


MOVE DIVER 


divide 


101111 


MOVE DIVBQ 


Avid* 


1 lOOOx 


SPEC RDxxx 


iud xwKi i>l f*S'**r 


MOOIx 


SPEC 




1 1010s 


SPEC 




1 1 0 1 1 x 


SPEC 




1 1 1000 


SPEC WRDR 


write descriptor 


111001 


SPEC WRDL 


write descriptor low 


1 1 101 x 


SPEC WSxxx 


write epecisJ register 


111100 


SPEC CHKS 


cheek selector, 


111101 


SPEC WRDH 


write descriptor high 


1 1 I 1 1 X 


SPEC WRIP 


write tnttraetion poutfer 



Several RegOp encodings, specifically xxOlx encodings, specify condition code 
de p ende n c e. The rotate and shift shift Ops are functionally equivalent except that different 
20 stitmod bits are asserted. 

The extension field (EXT) (14 is used, in combination with bh<0> of the TYPE field 
612, for MOVcc operations to specify a 5-bit condition code. The extension field (EXT) (14 is 
also used for RDxxx/WRxxx operations to specify a 4-bit special register number. The set status 
(SS) field 624, in combination with the EXT field 614, is used to 'Hrign»»f* the status flags that 

25 are affected by an operation. For Ops in which the set status (S S) field 624 is set to 1 , 

indicating that this Op does modify fl ags, the extension field (EXT) 614 specifies four status 
modification bits designating the gr oups of flags that are modified by the Op. For RDSEG 
Ops, the EXT field 614 specifies a 4-bit segment (selector) register. For a WRFLO r^Aitirr^] 
Op, the special register encoding matches the desired StatMod value for when the SS field is 

30 set. The set status (SS) field 624 and the EXT field 614 are RegOp fields. 

Condition codes are encoded in five bits, bit<0> of the 5-bit condition code field 
specifies whether the condition or its complement is to be tested for truth or assertion, for 
example if cc<0> is equal to one then invert the condition. Bits <4:1 > of the five-bit 
condition code (CC) field specify the condition to be evaluated, as follows: 

35 CC nurcrnftnifr condition 

0000 True 1 
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0 111 






10 00 
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CF 
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ZF 


ZF 


1011 


CVZF 


CF+ZF 


1 100 


SF SF 




1101 


PF PF 




1110 


SxOF 


3F*OF 


1111 


SxOvZF 


SF*OF+ZF 



Bit cc{0] specifies whether the condition or the complement of the condition is evaluated for 
truth. 

In the definitions, the character indicates a logical NOT operation, V 
a logical AND operation, specifies a logical OR operation, and ^~Hgr«*«»ft a logical 
XOR operation. OF, SF, ZF, AF, PF, and CF are standard x86 status bits. EZF and ECF are 
an emulation zero flag and an emulation carry flag, respectively, that «timiWjrm code uses in 
sequences implementing x86 instructions when the architectural zero flag ZF and carry flag CF 
are not changed. IP, DTF and SSTF are signals indicating an interrupt is pending, a debuf trap 
flag and a single step trap flag respectively. 

Branch conditions STRZ and MSTRC are logically identical and used in implementing 
x86 instructions such as a move string instruction MOVS. For these x86 instructions, emulation 
code stores an index in a register and creates a loop that ends with a BRCOND. Each iteration 
of the loop moves a block of data and decrements the index. Branch predictaion initially 
predicts that the BRCOND branches to the beginning of the loop. Condition MSTRC indicates 
that branch evaluation logic is to signal the instruction decoder when the index reaches a 
predefined point near completion of the x86 instruction. The decoder then changes the branch 
prediction for the BRCOND that is loaded into the scheduler. Accordingly, a mispredicted 
branch and areociatrd abort is avoided when looping is complete. Proc ess or efficiency is 
improved in this rnanner. 
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Emulation Environment yriffitfftltifln 

The rmnlation code s e q u ence r 519 controls various emulation environment mKdi^%i to substantially 
expend die number of e n c od e d ope r a tio n! beyond the number of operation entries in the «tmhnw»» code ROM 
232. The enailation code aBgneneer 510 includes low circuits that analyzes specific typically A^^nf 
5 encodings of an Op field to determine when a substitution is to be made and which gTt * Mrf i*TTtit7n is performed. 
Most encodings directly specify a field value. Other encodings indirectly specify a field value through 
emulation environment substitution. Usage of *™frtF?p mode substitution achieves wrwwKng Q f CISC 
functionality while s ub sta nti a ll y reducing the sue of emulation code ROM 232, advantageously reducing the size 
and coat of a p roce ssor integrated circuit. The Op fields, including the register field and some of the sue fields, 

10 directly specify a register or specify an indirect register specifier such as AX or Tl. Similarly, a field may 

select RegM which is the direct register specifier. The Op substitution logic analyzes the coding for the indirect 
register specifier to subsequently define die register coding and substitute the register encoding with the current 
register. The size fields select one byte, two bytes or four bytes, or select a D-size so that the current effective 
data size replaces the original encoding. The symbols such as HReg represent ppiticvlar encodings such as 

15 HReg T2 which is a symbolic r ep resent ati on of the encoding for T2. Illustrative symbols include Hreg, HDSz, 
HASz, HSegDeacr, HSpecOpType and others. 

Pseudo-RTL code, as follows, particularly describes the emulation environment substitution operation: 

// fimliiioo EuviHWHiBBi SubetiBaiooi 

// ; 

20 //Decode; 

fbr(h*i"0;i<4; 
•wteb (OpQued.Ope(i) .Type) ( 
case HOpTypc_RegOp: 

if ((HRefJUfAOxlc) --<OpQuri.Op^.U*gOp3rc\Ueg A Ox 1c)) 
25 OpQu*d.Op« (i) JUgOp^roIRcf - HRtf (ms* (DEC_EnJU*)); ) 

If ((HftegjacgM & Oxlc) - - (OpQu*d:Opj(t) JteiOp^fclRef A Oxlc)) ( 
OpQuad.OpeG) JttgOp.Sicllter - HRei<uixX(DEC_EiriRegin)); ) 
ift!OpQu^l.O I M(i)JUgOp.n (//nottmn»ds«e 

tfftKRet JUg & Oxlc) « - (OpQuad.OpeG) JLegOp.Imml & 
30 Oxlc)) 

OpQuafLOps® JtejOp.ImmS - DECEadte*; 
if((HJUf_RcfM A Oxlc) « (OpQu*d.Op«0) -feaOp JmmS & 
Oxlc)) 

OpQufti-OpsQ JbfOp JnmS « DECEmBegm; 
35 ) // Dot i ninili i iB 

if((HRcf_Ref A Oxle) - • (apQ«ad.Op»G)JUfOp JfctfReg * Oxlc)) ( 
Op(»ted.O pe (0JUgOp.P»etfUt - IflUf(uu*(DEC_Iunftcf)); ) 
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Ut(HRet_RegM A Oxlc) « (OpQusd.Qps(i)Jt<egOpJDe«Aef A Oxlc)) ( 

OpQtud.Ops(i)JU(Op .DutBcg * KRctfuix* (DEC_En&efm)); ) 
ifiHDSzJ> OpQutd Op<i) JUgQp J>Sx) ( 

OpQu»d.OpHi) JtcgOp JJSi - DBCEmDSz T HDSz_4B : 
5 HDSx_2fi) 

iftHDSzA OpQuadX)pt(i).R^Op ( 

OpQuAd.Ops(i)JtefOp.DSx « DECEmASz 7 HDSz_4B : 
HDSx_2B ) 
tfCHDSz.S OpQud.Op^DJUfOpJ)S4 ( 
10 OpQtud.Op«(i)JUiOp.DSz - RUXB 7 HDSx_4B : HDSz_2B; ) 

brmk; 
cam HOpTypo.LdStOp: 

if((HLdStO|>Type_LDST GpQiud.Opt(i)XdStOpType) AA 
D£C_EmLock) ( 

15 OpQi»d.Op<i).LdScOp.T7pe m HLdStOpType_LDSTL; ) 

Ut(HRcf_Ref A Oxlc) — (OpQuad.0p*<i).LjSt0p.DtuR£f A Oxlc)) ( 

OpQuad.O|»(i).LdStOpJ>BteRcf - HHcf (nutf (DEC_EmReg)); ) 
tftHRef JtetM A Oxlc) - * (OpQiud.O|M(i).LdStOp.DMaiUg& Oxlc)) ( 
OpQu»d.Op«(i) J AStOp DmRcg - HRettiuaCtDEC.EmRegio)); ) 
20 iftHDSzD - - OpQmd.Op^Q.LdScOp.DSz) ( 

OpQu»d.Of^XclStOpJ)Sx~DEC_EmDSz? HDSz_4BB : 
HDSz_2B; ) 
if(HASz_A mm OpQuad.Opa®XdStOp.ASz)( 

OpQuaad.OpsO)XdSlOp J^Sz - DEC_EmASz7 HASz_4B : 
25 HASz_2B ) 

tflHASzS — OpQurtX*j(0XdSeOp JtSx) ( 

OpQttMtOpa(0XdScOp.ASz - RUX_B ? HASz_4B : HASz_2B ) 
iftHASx.D " OpQiMd.Opsa>XdS(qp.ASz)( 

OpQiMd.Op«(1).LdScOp^Sz - DECJEmDSz 7 HASz_4B : 
30 HA&zlB ) 

if(HScgDeacr_OS OpQiud.O|»(i).IiiScOp^e S ) ( 

OpQoKLOp»(D.LdSlOp.Sef - HSegDcacr (uii*(DEC_ExnOprScg)); ) 

break; 
cue HOpType SpocOp: 

35 UPfltegJtcg A Oxlc) - - (OpQu»d.Opj(i)^pccOp -DeatReg & Oxlc)) ( 

OpQuad.Opi(D.SpecOp.Detfftcg- HR«f(uiii(DEC_EmIUg)); ) 
ifl(HRegJUlM A Oxlc) <OpQu«LOpf®.SpccOpJ)eA&eg A Oxlc)) ( 
OpQo*d.Ops(t).SpecOp JDuObeg + HReg(uixX<DEC_EmR£gm)); ) 
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iftHDfixD OpQuad.Ops(0.Sp*cOp Jttx) ( 

OpQu*d.Op«tf).SpccOp JKx « DEC_EmDSz ? HDSz_4B : 
HDSz2B; > 

// LDXD Op; 

if C0pQuad.Op^> .SpecOpTypc - - HSptcOpType_LDKD) ( 
HV«toriottl7 - OpQu^.Op^i>JpccOpJimnl7 
UO>EC_EmDSz)( 

fanml? (3,0) - (lmml7(4,0) < < 1) | lmml7 (0); ) 
OpQtt«d.Op«fl) SpTOp hnmtl7- Iml7 (0); ) 

bmk: ) 

ED&_OpQuad - OpQuad; 

EDR_ERET m DEC_EXTEmc ? IC_ERET : (OpQuad.Actioo « 
HOpScq^ERET); ) 

The emulation code sequenc er 510 sequences to a next entry point in the emulation code ROM 
232, producing an Op quad, including four operations (Ops) and an Opseq field. For each of 
the four operations in an Op quad, various substitutions are made, the type of substitution 
dflpftwfing on the particular operation type of the five general operation types. The five 
operation types include register operations (RegOps), load-store operations (LdStOps), load 
immediate operations (LIMMOps), special operations (SpecOps) and floating point operations 
(FpOps). Op formats for the five operation types are depicted in Figures 6A through 6E. 

The Ops that are produced by the emulation code ROM 232 are generic in nature. In 
particular, the Ops do not correspond exactly to an x86 instruction. Instead, the Ops from the 
emulation code ROM 232 form a structure for an x86 instruction. Various bit fields within the 
Op template are substituted using an substitution function performed by the emulation code 
sequencer 510. Simply described, the substitution function substitutes some bits of the Op with 
other selected bits. 

X86 instructions typically include an opcode followed by a modr/m byte. The modr/m 
byte designates an indexing type or register number to be used in the instruction. The modr/m 
byte has three fields including a 2-bit (MSB) mode field, a 3-bit (intermediate) reg field and a 
3-bit (LSB) r/m field. The mode field combines with the r/m field to form 32 possible values 
in d ica ti ve of eight registers and 24 indexing modes. The reg field specifies either a register 
number or three more bits of opcode information. The r/m field either specifies a register as 
the location of an operand or is used in combination with the mode field to define registers and 
indffxhig modes. 
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Figure 6 A is a register operation (RegOp) field encoding graphic that illustrates various 
fields in the RegOp formal. In the RegOp field 610, most-significant bits 36 and 37 are cleared 
to designate the operat io n at a RegOp. The RegOp field 610 also includes a 6-bit operation 
type (TYPE) field 612 at bit locations [35:30], a 4-bh extension (EXT) field 614 at bit locations 
[29:26], a RUl-only (Rl) bit 616 at bit location [25], a 3-bit operation/data size (DSz) field 618 
at bit locations [24:22], a 5-bit destination (DEST) general register field 620 at bit locations 
[21:17] and a 5-bit source 1 (SRC1) general register field 622 at bit locations [16:12] . Jib* 
R egOp field 610 also includes a single-bit s et stat us (SS) field 624 at bit location [9] , a single- 
bit source 2 (I) field 626 at bit location [8] and an 8-bit imi™Hi«f« data or general 
register for source 2 operands (IMM8/SRC2) field 628 at bit locations [7:0]. The TYPE field 
612 of RegOp encodings inchirtri, as follows: 



TYPEwfc 


dm 


TYPE 




OOOOOx 


ALU 


ADD/INC 




0000 1 x 


ALU 


MOV/OR 


move/or (ooty or if DSz»l) 


00010X 


ALU 


ADC 


add with carry 


000 1 1 x 


ALU 


SBB 


■ubtnet with borrow 


001000 


ALU 


AND/BAND 


logical tad 


001001 


ALU 


BAND 


logical tad 


0 0 1 0 1 x 


ALU 


SUR/RStm 

Hvuriiivii 




00 11 Ox 


ALU 


EXOB/XOR 


nxrhiaive or 


00 1 1 1 x 


ALU 


CMP 




010000 


SHIFT 


SIX 


logical ahift left 


010001 


SHIFT 


BIX 


logical ahift left 


0 1 00 1 x 


SHUT 


SRL 


logical ahift right 


0 1 0 1 Ox 


SHIFT 


SLC/RLC 


atrfft/rotate left, carry 


0101 1 x 


SHIFT 


SftC/BJtC 


ahtft/fotate right, borrow 


0 1 lOOx 


SHUT 


SLA 


arithmetic ahift left 


OlIOlx 


SHIFT 


SKA 


arithmetic ahift righi 


01 1 lOx 


SHIFT 


SUMRLD 


drift/rotate left, double 


0 1 1 1 1 x 


SHIFT 


ffFiyRftr* 


ahift/rotata right* double 


1 OOOOx 


MOVE 


RDFLO 


read flag 


1 000 1 0 


MOVE 


SEXT 




I 000 1 0 


MOVE 


ZEXT 




IOOIOi 


MOVE 


RDFLCS 




1 001 1 x 


MOVE 


MOVcc 




101000 


MOVE 


MUL1S 


rnufcjply aigond/iinaigucd 


101001 


MOVE 


MUL1U 


muhipty ijgnrnV u,tj »g'^* 
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The EXT field 614 is used to update condition flags including six flags corresponding 

to x86 flags and two emulation flags. The eight flags are divided into four groups* using one 

status modification bit per group of flags. The EXT field 614 defines "p*Ming of the various 

condition code flags substantially mdependent of the TYPE 612 specification, so that 

5 functionally related flags are controlled and updated as a group. Updating of related flags aa a 

group advantageously conserves control logic. The EXT field 614 defin es a set of bits which 

determine the flags to be updated for a particular instruction. Decoupling of condition code 

handling from operation type, using the independent TYPE 612 and set status (SS) field 624, 

allows some operations to be defined which do not update the flags. A ccordingly, for those 
— - ■ j 

10 circumstances in which updating of condition flags, is not necessary, it is highly adv^fg^' f to 

disable flag updating to avoid unnecessary dependency on previous flag values. 

The RUl-only field 616 is indicative of Ops mat are issued to the first register unit 244 
and not to the second register unit 246 so that the Rl field 616 is a bit for hard-encoding an 
execution unit specification. Thus, the RUl-only field 616 indicates that a particular operation 

IS is only executable on a specific execution unit, generally because only the specific execution 

unit incorporates a function implemented on that unit. The set status (SS) field 624 modifies 
status Msgs according to EXT field 614 settings. The I field 626 specifies whether the source 2 
operand is '""^"^ or a general register. The IMM8/SRC2 field 628 specifies a 5-bit general 
register if the I field 626 is zero. The IMM8/SRC2 field 628 specifies a signed ^TT—fo** 

20 value which is extended to the size of operation specified by the DSz field size if the I field 626 

is i 



In the case of register operation (RegOp) substitution, the operation (Op) from the Op 
quad is s register operation (RegOp). The instruction register 512 holds instruction bytes that 
are decoded by the vectoring decoder 418. During a vectoring instruction decode, the 

25 vectoring decoder 418 generates an initial vectoring quad and an entry point address based on 

the contents of the instruction register 512. At the same time, the vectoring decoder 418 
initialiirs the emulation environment variables, also called the emulation environment register 
516, from various information based on fields of the instruction register 512 and based on other 
information. Information from the emulation environment register 516 is supplied to Op 

30 substitution logic that performs the substitution operation. 

For RegOp substitution, various substitutions are made into fields of the RegOp format 
610 shown in Figure 6A. For the destination (DEST) 620 and source 1 (SRC1) 622 fields of a 
RegOp operation, a five-bit register encoding specifies a register using direct register specifiers 
(0-15) or indirect register specifiers (Reg and RegM). The direct register specifiers directly 
35 designate one of sixteen registers. The indirect specifiers, Reg and RegM, are substituted at Op 

decode time by the current register numbers (0-15) from the emulation environment registers 
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514. Substitution lakes places as the instruction decoder 230 decode aa instruction, vectors to 
an emulation code sequence in emulation code ROM 232, tniti«li»» the wimUring environment 
register 516 with various fields from the instruction and other information. During operation of 
an wiHilation code sequence, the Ops in the snqtience include various fields, such as indirect 
register specifier fields and size data fields, which are substituted based on current values of the 
cmnlstion environment register 516. Op substitution logic detenmnes whether the «n«v«tig« for 
a field are indicative of a snhsrihitwi field. An — «**K»g for a field may j~* imtf that other 
fields are substituted. For wrampio, the coding of indirect specifiers Reg and RegM in the dest 
620 and arcl 622 fields is indicative that the DSz field 618 is also substituted. 

With respect to the DSz field 618, the x86 instruction set instructions operate on 8 bit, 
16 bit or 32 bit data, depending on the current denuilt condition of the p rocessor 120. The DSz 
field 618 includes three bits which indicate a data size of one byte, two bytes or three bytes. 
For instructions specifying substitution of data size, indirect size specifiers ^gnwfr an A size, 
D size or S size. A data size substitution is determined by a B-bit and a D-bit The B-bit is 
specified by the current stack-segment register (SS). The D-bit is specified by the code-segment 
register (CS). For indirect size specifiers, S size is determined by the B-bit Toe effective 
address (A) size and the effective data (D) size are determined by the D-bit, possibly overridden 
by address or data size overwrite prefixes, and held in the ^irnilstifm environment register 516. 
In general, the indirect specifiers of A size, D size, and S size are substituted or replaced by the 
absolute encoding for the bytes or four bytes. For example, if a D size is ffflfrtrri. the data size 
is resolved into two bytes or four bytes based on the effective data size as specified by a bit in 
the emulation environment register 516. Similarly, if an indirect size specifier encodes the A 
size, the effective address size is specified by a bit in the emulation environment register 516 to 
be either two bytes or four bytes. If the indirect specifier selects the S size, a B-bit determines 
whether two byte or four bytes are substituted. 

Substitution of the Imm8/Src2 field 628 is performed only if source 2 (src2) operand is 
sn indirect register specifier, rather than an '""^"^ value. Op substitution logic dete r min g* 
whether the «n«nrfmj for the tmm^i f u (i) field 626 is indicative of a substitution in the 
Imm8/Src2 field 628 by — »»—"ig the I field bit 626 and, if an indirect register specifier is 
selected, substituting a five-bit general register designation into the innn8/src2 field 628 of the 
RegOp format 610. For a register-stored src2 operand using the niemory-indexed addressing 
form, no substitution takes place and the imm8/src2 field 628 of the RegOp format 610 is 
loaded with an index value. The RegOp format 610 is a RISC-type three (two source and one 
destination) operand format including the srcl field 622, the imm8/src2 field 628 and the dest 
field 620. A standard x86 format is a two operand format The I field 626 advantageously 
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allows the source 2 operand to flexibly take the form of either an imm*Ht«^ value or a general 
register. 

The RegOp field 610 defines a group of specific Ops, j"*i»*«fig signed multiply 
(MUL1S), unsigned multiply (MUL1U), multiply high data (MULEH) and multiply low data 
5 (MULELjOps, that execute parts of nadtiplication and unloading operations so that a multiply 

instruction is made up of a plurality of these specific multiplication Ops. A division operation 
is aimilariy executed by combination of a plurality of specific simple division Ops, im-Wing 
one and two bit divide (DIV1) V step divide (DIV2), divide unload remainder (DIVER) and 
divide unload Quotient (DIVEQ) Ops, for ei ample performing a two-bit iteration on a divide. 

10 The WRIP Op writes the x86 program counter (PC) to change the ««miHtm address 

when the WRIP Op is retired, restarting instruction fetching at a desired address. The WRIP 
Op is particularly advantageous in the case of various instructions for which *»ding of the 
instruction length is difficult or impossible so that, to avoid logic complexity, the implemented 
logic does not correctly decode the length. The instruction is also advantageous for serializing 

15 instruction fetch operations and for emulating b lan che s . Efficiency is achieved by allowing the 

logic to incorrectly decode the instruction length and to set the program counter using the WRIP 
Op so that the incorrectly decode instruction length is ignored. 

The check selector (CHKS) Op is used for handling x86 instruction set segment 
descriptors and selectors. Hie CHKS Op initiates a process of loading a segment register, 
20 including the operations of checking for a null selector and generating an address offset into a 

descriptor tame. 

Several write descriptor Ops (WRDH, WRDL, WRDR) are defined to perform real 
(WRDR) and protected mode (WRDH/WRDL) segment register loads, as are known in the 
computing arts and most particularly with respect to conventional x86 operations. The write 

25 descriptor real (WRDR) Op loads the segment register in a real mode fashion. The write 

des cripto r protected mode high and low (WRDH/WRDL) Ops perform a sequence of checking 
operations in the protected mode. In a conventional x86 ar c hite ctu re, access of a task to 
instruction code and data segments and the I/O address space is automatically checked by 
pr o cesso r hardware, typically operating as a state machine. In the p roce s sor disclosed herein, 

30 the conventional protected mode checks are instead performed as a sequence of operations as an 

emulation code sequence of hardware primitives. 

Figure 6B is a load-store operation (LdStOp) field encoding graphic mat illustrates 
various fields in the LdStOp format. In the LdStOp field €30, most-significant bits 36 and 37 
are respectively set to one and zero to designate the operation as a LdStOp. Hie LdStOp field 
35 €30 also includes a 4-bit operati o n type (TYPE) field €32 at bit locations [35:32], and a two-oit 



SUBSTITUTE SHEET (RULE 26) 



WO 97/13194 



41 



PCI7US96/15422 



index scale fetor (ISF) field 04 at bit locations [31:30] designating frctors Ix, 2x t 4x and 8. 
The LdStOp field GO includes a 4-bit eegment roister (SEC) field CM at bit iocationa [29:26] 
and a two-bit addraas calculation size (ASz) field <38 at bit location* [25:24] ^rrrgntting a 
selection between A size* S size, D size and four bytes. The effective data and address sizes 
5 are substituted for LdStQps in the same manner as RcgOp substitution. The LdStOp field 630 

includes a 2-bit data aize (DSz) field (40 at bit locations [23:22] specifying sizes (l t 2, 4 and 
DSize bytes) for integers and sizes (2, 4 and 8 bytes) for floating point, a 5-bit data 
aouree/destination (DATA) general register field (42 at bit locations [21:17] and a single-bit 
large displacement (LD) bit (44 at bit location [16] specifying a large ^^fnit using the 

10 Disp8 displacement from a pr ece din g Op. The LD bit (44 is useful since Ops are only 38 bits 

wide, an instruction format insufficient to specify a full 32-bit displacement into an operation. 
Only displacements encoded in eight bits are possible for a single LdStOp field 630. The LD 
bit 644, when asserted, indicates that the immediately preceding Op supplies a full 32-bit 
displacement. The LdStOp field 630 includes a 4-bit base (BASE) general register field 646 at 

15 bit locations [15:12]. The LdStOp field 630 also includes an 8-bh signed displacement (DISP8) 

field (48 at bit locations [11:4] and a 4-bit index (INDEX) general register (49 at bit locations 
[3:0]. The TYPE field 632 of LdStOp encodings include, as follows: 



TYPE code 


TYPE 


Description 




0000 


LD 




a 


0001 


LOP 


k»d forio* poistdsu 


0010 


LDST 


loadatefttrdtf 


* wish mo* 


00 1 1 


LDM 




i dau 


0100 


CDAF 


CDA pbs» fluah 


each* Hoc 


01 01 








0110 


LOfiTL 


load kft. w/tton 


icfaacfc,fta 


0111 


LDST 


load to. wtaon 


icfaack 


1000 


ST 


rion ttaagar dal 


ta 


1001 


STP 


atore floating p< 


xmdata 


toto 


STUPD 


HoratoL w/b« 


m waja^f i 


1011 


STM 




i data 


1 100 


CDA 


chackdaiaaflec 


tivo addiai 


1101 


CIA 




o ©ftactiv© 


1110 


HA 


TLB invalidate i 


addiaai 


1 1 1 1 


LEA 


load affbetiva m 


Idiaii 



20 



25 



30 



35 

In the case of load-store operation (LdStOp) substitution, the emulation code sequencer 
510 first determines whether a LOCK prefix has been acknowledged during setup of the 
emulation environment If die designated LdStOp operation is a load integer with store check 
(LDST) and the LOCK prefix is acknowledged, the emulation code sfqnrncer 510 substitutes a 

40 load integer with store check, locked (LDSTL) opcode for the LDST opcode. 

For LdStOp substitution, various substitutions are made into fields of the LdStOp 
format 630 shown in Figure fB. For the data register (DataReg) field 642 of a LdStOp 
operation, a five-bit register encoding specifies a register using direct register specifiers (0-15) 
or indirect register specifiers (Reg and RegM). The direct register specifiers directly drsignitr 

45 one of sixteen registers. The indirect specifiers, Reg and RegM, are substituted at Op decode 

time by the current register numbers (0-15) from the emulation e n viro nm ent registers 516. 
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Snh gt it uti on takes places as the nrtniction decodes 230 decode an instruction, vectors to an 
rrnularion code sequence in rmnlaltno code ROM 232, ™^i1i7r the emulation environment 
register 516 with various fields from the instruction and other information. During operation of 
an emulation code sequence, the Ops in the se qu e n ce include various fields, such ss indirect 
register specifier fields and size data fields, which are substituted based on current values of the 
emu l ation environment register 516. Op substitution logic determines whether the *t***iing» for 
a field are indicative of a substituted field. An for a field may indicate that other 

fields sre substituted. The particular substitute depends on whether the data register 
(DataReg) 642 is addreased using register sddreasing or memory-indexed addressing. If the 
register addressing form is des i g nate d, DataReg field 642 of the LdStOp format 630 is 
determined by indirect specifier Reg, If memory-indexed addressing is designated, DataReg 
field 642 of the LdStOp format 630 is determined by indirect specifier RegM. 

With respect to the ASz field 638 and the DSz field 640, the x 86 instruction set 
instructions operate on 8 bit, 16 bit or 32 bit data, «*«pmrimg on the current default ^y*iHnn of 
the processor 120. The ASz field 638 and DSz field 640 each include two bits which indicate a 
data size of one byte, two bytes or three bytes. For instructions specifying substitution of data 
size, indirect size specifiers /4 *""g™ f » an A size, D size or S size. A data «™ substitution is 
determined by a B-bit and a D-bit. The B-btt is specified by the current stack-segment register 
(SS). The D-bit is specified by the code-segment register (CS). For indirect size specifiers, S 
size is determined by the B-bit. The effective address (A) size and the effective data (D) size 
are determined by the D-bit, possibly overridden by address or data size overwrite prefixes, and 
held in the emulation environment register 516. In general, the indirect specifiers of A size, D 
size, and S size are substituted or replaced by the absolute encoding for the bytes or four bytes. 
For example, if a D size is selected, the data size is resolved into two bytes or four bytes based 
on the effective data size ss specified by a bit in the * gw »» 1 «fk i n environment register 516. 
Similarly, if an indirect size specifier enc od e s the A size, the effective address size is specified 
by a bit in the emulation environment register 516 to be either two bytes or four bytes. If the 
indirect specifier selects the S size, a B-bit determines whether a two byte or four bytes are 
substituted. 

The LdStOp operation is checked to determine whether a four-bit segment register field 
636 is to be substituted. When the emulation environment is set up, segment override prefixes 
are monitored. Segment override prefixes affect the decode of a subsequent instruction when 
the generation of a Ld&Op is dependent on the effective operand segment of the instruction. 
The default segment is DS or SS, depending on the associated general address mode, and is 
replaced by the segment specified by the last segment override prefix. The segment register 
address space is expanded from conventional x86 specification to four bits to allow qHH^^ .i 
special segment registers to be support. The segment register is encoded as follows: 
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SegRegf 



Seg 



oooo 

0001 
0010 
001 1 
0100 
0101 
0110 
0111 
lOOx 
1010 
1011 

1 1 XX 



ES 

cs 
ss 

DS 
PS 
OS 
HS 



aicjujacluiaJ acsnmf restate? 
ctcfatactml acsinwa register 
arvhitacnin j iBBi refbtcr 

uduMctml msuw is itfixcf 
(raaarvad) 



TS 
LS 
US 
OS 



rtwripmr tahk Scgjtef (GOT or LPT) 
Untax SegRag Cm, null aegmeiftataoc) 

•fbcthra (arefa.) data asginaaft rtftaur 



At Op decode time, die mwhtion code seqwawtr 510 replaces the *OS" segment 
register with the current three bit segment register number from the emulation environment 
Thus, a segment register is alternatively hardcoded in a specific segment register, or set to 
segment OS, which drmgnatrs the current effective data segment register but which can be 
overridden with a segment override. 

The emulation memory segment register (MS) designate* access of a special emulation 
memory for usage in the emulation environment The descriptor table SegReg TS designates 
access of either the global descr iptor table (GDT) or the local descr ip tor table (LDT). 

Figure 6C is a load operation (UMMOp) field encoding graphic that 

illustrates various fields in the UMMOp format In the LIMMOp field €50, most-ognifkant 
bits 36 and 37 are both set to one to designate the operation as a UMMOp. The LIMMOp 
field 650 also includes a 16-bit immediate high part (ImmHi) 652 at bit locations [35:20], a 5- 
bh destination (DEST) general register field 654 at bit locations [19:16] and a 16-bit immediate 
low part (ImmLo) 656 at bit locations [15:0]. ImmHi 652 and ImmLo 656 combine to specify 
a 32-bit value that is loaded into the register specified by the DEST field 654. 

Figure 6D is a special operation (SpecOp) field encoding graphic that illustrates various 
fields in the SpecOp format In the SpecOp field 660, most-significant bits 35 to 37 are 
r espe cti vely set to 101 to designate the operation as a SpecOp. The SpecOp field 660 also 
includes a 4-bit operation type (TYPE) field 662 at bit focauons [34:31], a 5-bit condition code 
(CO field 664 at bit locations [30:26] and a two-bit data size (DSz) field 666 at bit locations 
[23:22] designating sizes of 1, 4 and DSize bytes. The SpecOp field 660 includes a 5-bit 
destination general register (DEST) field 668 at bit locations [21:17] and a 17-bit iirnnodiste 
constant (Imml7) field 670 at bit locations [16:0]. The Imml7 field 670 holds either a 17-bit 
signed value or a 14-bit Op address. The CC field 664 is only used by BRCOND 

Ops. The DSz field 666 and DEST field 668 are used only by LDKxx Ops. A standard NOP 
Op is defined to be a IJMM t0,< undefined >. The TYPE field 662 of SpecOp encodings 
include, as follows: 



SUBSTITUTE SHEET (RULE 26) 



WO 97/13194 



44 



PCT/US96/15422 



TYPfiwft XXEE Ebsdoso 



00 xx BBCOND bftocfa coadkaoo 

OlOx LDDHA Ml defeis&aa handler addnm 

Ollx LDAHA BrtaemMtt*^ handler mUrm 

lOOx LDK laadeaatfa* 

lOtx LDKD load caatfaat, date 

1 1 x x FAULT u nrantfiti o na l frab. 



In the case of special opentioo (SpecOp) substitution, the SpecOp is checked to 
determine whether a destination genera] register (DestReg) is addressed using register 
addressing or memory-indexed addressing. If the register addressing form is designated, the 
emulation code sequencer 510, substitutes a previously stored modr/m reg field from Reg 
register into the DestReg field 668 of the SpecOp format 660 shown in Figure 6D. If the 
memory-indexed addressing form is designated, the emulation code sequencer 510, substitutes a 
previously stored modr/m regm field from Regm register into the DestReg field 668 of the 
SpecOp format. The DestReg field 668 holds a five-bit register number which is the destination 
of operation LDK or LDKD. 

The SpecOp is also checked to determine whether the data size (DSz) field 666 is to be 
substituted, A substitution value for the DSz field is determined previous to Op handling by the 
fgimlarion code sequencer 510 when the emulation environment is defined in accor dan ce with 
current segment default mfonnation as overridden by any operand size overrides. The DSz 
field 666 is set to a size of 1 byte, 4 bytes or an alternative defined D size for the load 
operations LDK and LDKD. The emulation code sequencer 510 substitutes a designated 
substitution value into the DSz field 666 of the SpecOp format 660. 

If a SpecOp is a load constant, data (LDKD) operation, the emulation code s equ encer 
510 adjusts or scales data in the 17-bit immediate (Imml7) field 670 based on the current 
effective data size, which is specified by the date size (DSz) field 666. THe 17-bit -«~««»»» 
(Imml7) field contains a 17-bit constant, a 17-bit signed immediate or a 14-bit Op address. 
Register Address Space anc| firmjlfitrn IntTffttt 

The processor 120 implements a register address space that is expanded in comparison to a 
conventional x86 register address space. The p roces s or 120 register address space is addressed 
by five bits so mat 2 s , or 32, registers may be defined. Eight registers c or re sp ond to x86 
architectural registers. Sixteen additional temporary registers are added. The data size (DSz) 
field either directly or indirectly specifies the data size of the operation. For a single-byte 
operation, a register designated by the register encoding field is different from a two or four 
byte register drsignated by the same register mending field. The data size of an operation may 
be either one-byte (I s ) or two/four bytes (2*/4*) for RegOps, SpecOps and the data field 642 of 
LdStOps. The data size is always two/four bytes (PtfP) for the base 646 and index 649 fields 
of the LdStOps. 
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The pr oce ss o r 120 register field encoding is, as follows: 















2B/4B 




IB 


Descriution 


00000 


AX/HAX 




AL 




00001 


CX/BCX 




CL 




ooo to 


DX/EDX 


DL 






000 t 1 


BX/EBX 


BL 






00 100 


SWEPS 




AH 




00101 


BF/EBP 




CH 




00110 


SI/E5I 




DH 




00111 


DVEDI 




BH 




0 1000 


II 




tIL 


diap32 


01001 


a 




OL 


nnni32 


01010 


13 




UL 


eddreavlenx) 


01011 


14 




•4L 




01100 


tS 




UH 


eddrcaa tamp 


01101 


rt 




(2H 


morff/n effective eddntse 


01 1 10 


<7 




OH 


next decode PC (acq or ret) 


01111 






t4H 




1 0000 


t8 




ISL 


ifltennediaie data tanp 


I AAA I 
t V U V 1 






•Of 


injirriwrfitte data temp 


1001 0 


110 




U0L 


inicnnediate data leap 


1001 1 


til 




tllL 


intermediate data teop 


10100 


U2 




I8H 


iflatnmdiate data temp 


10101 


113 




C9H 


tflfeGflDOdiliSSO dttA 9C&^p 


10110 


tl4 




UOH 


IflatCfflSOdjSfch) dctft tdDp 


10111 


U5 




U1H 


data save temp 


1 1 Oxx 


Rff 




Kg 




1 1 1 XX 











30 M nem on ics "tO" and "_° are synonyms for a register than can be written but always returns a 

zero when read. is typically used in a context where an operand or result value is a "don't 
care". 

Twenty-four general registers are supplied. The first eight registers correspond to the 
x86 general registers AX to DI. The remaining sixteen registers are temporary or scratch 
35 registers used within multiple operation sequences implementing CISC instructions. Operations 

which use 5-bit register numbers access 32 registers and renaming register numbers that are not 
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used far integer registers ere used is multimedia registers or placeholders far ~™ifttirm 
environment variable fn*^it u t kr ti 

Tbe x86 integer register set supports addressing for byte operations of either of the 
low two bytes of half the registers (AX, OC, DX, and BX). Based on register size 
ipeqficanon, the three-bit register numbers within the x86 instructions are interpreted as cither 
high/low byte registers or as word/dword registers. From an operational perspective, this size 
is specified by either the ASz or DSz field of the operation. (ASz for base and index registers 
m LdStOps, and generally DSz for data/dest, srcl and src2 registers. The scratch integer 
register set supports similar addressing of the lower two bytes of half the registers (t 1 *t4 and t8- 
tll). 

The intermediate data temporary registers are used, beginning with t8 in order of value 
derivation or little-endian significance. The 'reg* and 'regm* register numbers are replaced at 
Op decode time by the current three bit register number from the emulation environment, 
specifically by substituting Reg/RcgM with "OOxxx*. The three-bits designate a register from 
among the first eight registers (AX through DI). 

The organization of the Op formats and die expanded register address space define a 
flexible internal inicroarchitectaral state of the processor 120. The x86 architectural state is 
merely a subset of the processor 120 microarchitectural state. The expanded register address 
space includes an expanded number of temporary registers, as co m p ai e d to the x86 register 
address space. Twenty-four registers arc supplied by the expanded register address space of the 
processor 120, only eight of these registers correspond to the x86 architectural registers and 
sixteen additional temporary registers are supplied. Similarly, the processor 120 includes an 
expanded number of flag registers, only some of which correspond to the x86 7 

For example, in one processor 120 embodiment, an emulation carry flag is defined in 
addition to the conventional carry flag and associated with the emulation environment. Tims, a 
conventional add instruction is defined that sets the carry flag and changes the permanent 
architectural state of the processor 120. An emulation add operation is defined ***** sets the 
emulation car ry flag and not the conventional carry flag. Various instructions are defined to 
branch on the basis of the emulation carry flag while other msrnaVions branch on the basis of 
the conventional carry flag. Thus, th e processor 120 is conc urrently operational in both a 
conventio Cjaj jnicroflTC^ state and in an emulation microarchitectural^ ■Mr. This 

opera tion in an emulation environment allows the processor 120 to execute an *"H fftKrn 
microsequence and to not change the visible microarchitectural state. 

Referring to Figure 8, the substitution technique is illustrated by a specific example. 
An ADD register instruction in the x86 instruction set is encoded as an opcode followed by a 
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modr/m byte. The 8-bit opcode field of the ADD instruction, in combination with specification 
in the modr/m byte that the instruction is a register mstruction, are used to derive a parttmUr 
ROM entry point that causes the emulation code ROM 232 to produce an Op for an ADD 
RegM to Reg operation. The reg and regm fields in the instruction register 512 specify that reg 
becomes AX and regm becomes BX, respectively. The reg and regm fields are applied from 
the instractioo register 512 to the Op substitution circuit 522 so that substitution is achieved. 
The emulation code ROM 232 includes Op templates for only a limited number of Ops and the 
Op substitution circuit 522 fills the templates variously to generate a large number of different 
Ops. 

Referring to the listed processor 120 register field encodings, register AX is hard- 
specified by a reg code, 0 0 0 0 0. Alternatively, four possible reg encodings, 1 lOxx, 
specify substitution of a register. The substitution is made based on the particular instruction 
errmlatfid and the current state of the processor 120 including the current fcnmlnt fon 
environment Generally, a substitutioo and execution of an operation is performed by 
substituting: (1) an encoding for a Reg into the dest field 620, (2) an *«~v*ing for a register 
Reg into the srcl field 622, (3) an encoding for a RegM into the Imm8/Src2 field 628, (4) a 
zero into the immediate (I) field 626, and (5) either the one, two or four bytes rf~i gt*«N>* by the 
data size into the DSz field 618. 

Referring to figure 9, an exemplary ernbodixnent of a scheduler 260 is shown. Hie 
scheduler 260 includes 24 entries which are issociated with up to 24 operations. Each entry 
includes a set of storage elements (noininally flip-flops) in a crfa«Hnig reservoir 940 and 
portions of logic 931 to 936 associated with the entry. The storage location contain information 
regarding an operation (Op) which is awaiting execution, in the process of being executed, or 
completed. Most of the storage elements are loaded/initialized when the instruction decoder 220 
loads a new operation into scheduler reservoir 940. Some storage element* m loaded or 
reloaded later such as when the operation completes execution. Storage fields that retain a 
value from the time an operation is loaded into «^K*dtilf«g reservoir 940 until the operation is 
retired from scheduler 260 are referred to herein as "static fields." Fields which can be 
reloaded with new values are referred to as "dynamic fields." 

Individually, a storage element entry may be considered an enabled flip-flop. In 
combination, with respect to the scheduler 260, the plurality of storage elements are a shift 
register that is four entries wide and six rows (Op quad entries) deep. Storage elements 
as s o ci s te d with dynamic fields are loaded from a data source or a preceding storage element 
Data is loaded into a storage element and simultaneously and independently shifted into a 
different storage element, 
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Scheduler 260 controls RU, LU, and SU pipelines created by register units 244 cod 
24f, load unit 240, and store unit 242. Each RU pipelines has three stage referred to as the 
issue stage, die operand fetch stage or stage 0, and stage 1. LU and SU pipelines four stages 
referred to as the issue stage, the operand fetch stage or stage 0, stage 1, and stage 2. As 
5 described above, the State field re p resents the five states using 'shifting/ucreasing field of 

ones" encoding to indicate the current pipeline stage of the «™™«»fi1 operation or indicate that 
the operation has completed its pipeline. 

Static fields are defined as follows and all values are active high: 

Type(2:0] Specifies the type of operation, particularly for issue selection. For example, LdOps 
10 is selected for issue to LU alone. Type[2:0] decodes into three bits as follows: 

000 A Special operation not actually executed. 

010=LU A LdOp executed by load unit 

lOx— SU A StOp executed by store unit 

101 « ST A StOp which references memory or at least generates a 

15 faultable address (i.e. not an LEA operation). 

llx-RU A RegOp executed by RUX or possibly RUY. 

110=RUX A RegOp executable ONLY by RUX. 

ill =RUY A RegOp executable by RUX or RUY. 

UMmm If a LdStOp, then the operation uses a large displacement (versus a small 

20 (8-bit) displacement held within this entry). If the operation is a RegOp, then 

an immediate value is Src2 operand. 

SrclReg[4:0] Source 1 operand register number. 

Src2Reg[4:0] Source 2 operand register number. 

DestReg[4:0] Destination register number. 

25 SrcStReg(4:0] Source register number for Store data. 

SrclBM[l:0] 9 Src2BM[l:0], and Srcl2BM[2] are byte marks designating bytes of SrclReg and 
Src2Reg that must be valid for execution of the operation. Bits [2], [1], and [0] 
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correspond to Reg[31:16], Reg[I5:8] and Regf7:0] respectively. Srcl2BM[2] applies 
to both SrclReg and Src2Reg. 

SrcStBM[2:0] Byte marks aemgnstmg bytes of DestReg, when used by a StOp to specify the 
Store Data register, that are necessary for execution and actually completion of the 
5 StOp. Use bit correspondence is the same as for SrclBM[l:0]. 

Oplnfofl2:0] Additional information used only by me execution units or the operation commit 
unit (OCU) 970 oVpenriing on whether the operation is executable. The scheduler does 
not otherwise use the infonnation. Opinio is a union of three possible field definitions, 
depending on whether the Op is a RegOp, a LdStOp, or a SpecOp. 

10 For a RegOp: 

Type[5:0] Copy of the original Op Type field. 

Ext(3:0] Copy of the original Op Ext field. 

Rl Copy of the original Op Rl field. 

DataSzflrO] The effective data size for the operation (1,2,4 bytes) 
IS For a LdScOp: 

Type[3:0] Copy of the original Op Type field. 

ISF[1:0] Copy of the original Op ISF field. 

Seg[3:0] Copy of die original Op Seg field. 

DataSz[l:0] The effective data sue for the memory transfer. 

20 AddrSz The effective address size for the address calculation. (32/16 bits) 

For a SpecOp: 

Type[3:0] Copy of the original Op Type field. 

CQ4:0] Copy of the original Op cc field. 

Dynamic fields are defined as follows with all values active high: 
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State[3:0] Indicates the current executwo state of an c?cr*W (S3,S2,S1,S0 are alternate 
signal names for State(3:0].) Five possible states are encoded by a shifting field of 
ones across the four bits: 

0000 Unissued 

5 0001 Stage 0 

0011 Stage 1 

0111 Stage 2 

1111 Completed 

The intermediate states co r respond to the Op's current execution stage. The bits are updated 
10 (effectively shifted) as the operation is successfully issued or advances out of ■ stage Note: 

some operations (e.g. LDK) are loaded into the scheduler with an initial State of 1111 and thus 
are already "completed". State[3:0] is also set to 1111 during abort cycles. 

if (SOEnbl) SO = ~BumpEntry + SC_Abort 

if (SlEobl) SI - (SO -BunmEntry) + SCAbort 

15 if (S2£nbl) S2 - SI + SC_Abort 

if (S3£nbl) S3 » S2 + SI RU + SCAbort 

BunmEntry = RU ~S1 SO (Excel BumpRUX + -Excel 

BumpRUY) 

SOEnbl « "Issue OPi to LU" CHP_LUAdvO + 
20 "Issue OPi to SU" CHPSUAdvO + 

"Issue OPi to RUX" COTRUXAdvO + 
"Issue OPi to RUY" CHP_RUYAdvO + 
SC_ Abort + BumpEntry 

SlEnbl = LU CHP_LUAdvO + SU CHP_SUAdv0 + 
25 RU (Excel CHP.RUXAdvO + -Excel 
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CHP_RUYAdvO) + SC_Abort 
S2Enbl « LU CHPJAJAdvl + SU CHP_SUAdvl + RU + 
SCAbort 

S3EnW - LU CHPLUAdv2 + SU CHP_SUAdv2 + RU + 
SCAbort 

Excel If the operation is a RegOp, then RUX (not RUY) executes the operation. Execl is 
loaded when the operation has successfully been issued. 

if (SOEnbl) Execl = 'Issue OPi to RUX" 

DestBM[2:0] Byte marks designating bytes of DestReg that the Op modifies. Bits 

[2,1,0] correspond to DestReg(31:16], Reg(15:8], and Reg[7;0] 
respectively. DestBM is cleared during abort cycles. 

if (SC_ Abort) DestBM = 3*b0 

DestVal[31:0] DestVal is the register result value from execution of the Op to be coinmttted 



to DestReg. DestBM indicates which bytes are valid after Op execution. 
DestVal is loaded when the Op completes execution stage 1 or 2, depending 
on the type of Op. For non-executed Ops such as LDK, DestVal is initialized 
with the appropri ate register result value. DestVal alternatively holds large 
immediate and displacement values for RegOps and LdStOps (respectively), 
and the alternate (sequential or target) branch PC value associated with a 
BRCOND Op. 



if ((-S2 + LU) -S3 SI) DestVal - switch (Type) 



caseLU: 



G DestRes 



caseSU: 



SU1 DestRes 



case (RU Execl): 



RUX DestRes 



case (RU -Execl): 



RUY DestRes 



// an alternative mux control equation 



// » if(S2) G_DestRes 
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// dseif(SU) SUl_DestRes 

// elseif (Excel) RUX_DestRes 

// else RUY_DestRes 

StatMod[3:0] Status group marks inritrarmg which groups of status bits the Op modifies. Bits 
5 [3.2,1,0] correspond to {EZF t EGF} t OF, {SF,ZF,AF,PF}, and CF respectively. 

StatMod is always all zeroes for noa-RegOps and is cleared during abort cycles. 

if (Excel -S3 SI RUXNoStatMod + SC_Abort) 

StatMod -= 4*b0 

StatVal[7:0] Statval is the status result value from execution of this Op and is committed to 
10 EFlags. StatMod drrignatra valid bits after Op execution. StatVal is significant only 

for RegOps as is reflected by a zero value of StatMod for all non-RegOps. StatVal is 
loaded when the RegOp completes execution stage 1. 

if (-S3 SI) StatVal = 

(Excel) ? RUX_StatRes : RUY_StatRes 

15 OprndMatchJOCsrcY where "XX" is LU, SU, RUX, or RUY and 'Y" is 1 or 2. 

Additional storage cicmftitB are used for storing transient information that is passed 
between two stages of logic. This differs from information of more global significance. 
Storage elements are physically the same as the above and are simply listed for completeness. 
Since additional storage elements pass information from the Issue stage to stage 0 of each 
20 p roce ssin g pipeline or, in one case, from stage 1 to stage 2 of SU, storage element values are 

controlled by the XXAdvO (or SUAdv2) global signals. 

if(LUAdvO){ 

OprndMatch_LUsrcl = ... 

OprodMatch_LUsrc2 = ... } 

25 if(SUAdvO){ 

OprndMatch_SUsrcl = ... 

OprodMatch_SUsrc2 = ... } 



SUBSTITUTE SHEET (RULE 26) 



WO 97/13194 



53 



PCT/US96/15422 



if(SUAdv2){ 

OpmdMatch_SUsrcSt = ... } 
if(RUXAdvO){ 

OpnuiMatch_RUXsrci « ... 
5 OpnidMatchJUJXsrc2 » ... } 

if(RUYAdvO){ 

OpradMatch.RUYsrcl = ... 

OprodMatch RUYsrc2 = ... } 

DBN[3:0] The four Bn (n=0-3) data breakpoint status bits for a LdStOp. This field is initially 
10 all zeroes, then, when a LdStOp in this position executes, breakpoint bits from the 

appropriate unit are recorded for later trapping. 

if ((AdvLU2 + AdvSU2) -S3 S2) DBN[3:0] « (DBNLU[3:0] LU) + 
(DBN_SU[3:0] SU) 

Scheduler Op Quad Entry Definition 

15 The twenty-four entries are managed in FIFO fashion with new operations loaded in 

one end (the "top"), shifted down to the other end (the "bottom"), and retired or unloaded from 
the bottom of scheduling reservoir 940. To simplify the control, scheduler 260 manages 
reservoir 940 on an Op quad basis. Operations are loaded into, shifted through, and retired 
from reservoir 940 in groups of four. This matrhrs the fetching and generation of operations in 

20 the form of Op quads by both the emcode ROM and the MacDec in decoder 220. 

The scheduling reservoir 940 may be viewed as a six entry shift register containing Op 
quads. Each Op quad entry contains four entries, plus additional information fcyyi?^ with 
the Op quad as a whole. The following enumerates the additional "Op quad" fields and 
provides a short functional description of each field. All values are defined active high. Most 
25 of the fields are static. The description of the dynamic fields includes a pseudo-RTL definition 

of the logic generating the data and enable inputs of the storage elements holding these fields. 

Emcode: Indicates whether this is an emcode op quad from emcode ROM or an Op quad 
generated by the MacDec. 
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Eret: Indicates that this is in emoode Op quad and that "contains" an ERET (return from 
exncodc ROM). 

FaultPQ31:0]: The logical macroinstruction (ml) fruit program counter value ««r^ f ^ ynth 
the operations in the Op quad entry. Used by the operation commit unit (OCU) 970 
when handling fruit exceptions on any of the operations. 

BPTInf6[14:0] Branch Prediction Tabie-feUted information from when the Op quad was 

generated. This field is defined only for MacDec-generated Op quads which a 
BRCOND operation. 

RASPtr[2:0] The Return Address Stack the top of the stack (TOS) value as of when the Op 
quad was generated. This field is defined only for MacDec-generated Op quads which 
contain a BRCOND operation. 

LimVtol Indies tra that the Op quad is the decode of a transfer control instruction for which a 
code segment limit violation was detected on the target address. For most Op quads, 
this field is static. For the first register Op quad entry, this field can be loaded as 
summarized by the following pseudo-RTL description: 

0clk: LdLV « LdEntryO -DEC_OpQSel_E 

if (LdLV) LimViol « DECLimViol 

OpQV Indicate* whether the Op quad entry contains a valid Op quad and is used by the logic 
controlling the shifting of entries. Entries within for an "invalid* Op quad entry must 
still be defined and must have the same state as aborted entries. 

if (SC_Abort) OpQV = 'bO 

OplI, Op2I, Op3I A count of the number of microinstructions (ml) represented by this Op 
Quad (1, 2 V or 3). Used to maintain a count of retired instructions. 

IlenO, Ilenl The lengths in bytes of the first, and (if present) second macrcHnstructkm 

represented by the Op Quad. Used to determine the instruction address at which a 
fruit occurred. 

SmclstAddr, SmclstPg, Smc2ndAddr, Smc2ndPg 

The first and (if there are instructions from more than one page in the Op quad) second 
address covered by operations in the op quad. Used to detect self-modifying code. 
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Hie instruction decoder 220 also includes an Op substitution circuit 522 that connects 
the niacToinstfuction decoder 230 and the emulation code ROM 232 of the instruction decoder 
220 to the scheduler 260. The Op substitution circuit 522 expands RISC instructions into field 

5 values load into the top. Op quad entry 0, of scheduler reservoir 940. To generate an Op quad, 

Op substitution circuit 522 modifies some fields from the RISC instruction based on "*Hrr 
fields, derives new fields from existing ones, replaces some fields with physically different 
fields, and passes a few fields through unchanged. Environment variable substitution for fields 
of emcode operations requiring such, and the nulling of operations within emcode op quads 

10 which were jumped into the middle of, are bom p er form e d in logic preceding OpDec (i.e. 

EmDec). 

The following pseudo-RTL equations describe the generation of each field of a 
scheduler entry from the fields of an «tc*«ning RISC instruction. Hie "^^"n xxxOp.yyy 
refers to a field yyy defined for an instruction of type xxxOp, irrespective of the actual type of 
15 the mstractkm. For example, RegOp. S re I refers to bits in an instruction at the same position as 

the Srcl field of a RegOp. Figures 6A through 6E show exemplary field definitions for a 
RegOp, LdStOp, UMMOp, SpecOp, and FpOp. 

Type(2.-0) twitdKOpld) { 

cam RegOp: Type[2,l] - *bll ( 
20 Type{0J - ~ (RegOp Jtl + RUYD) 

LdStOp: TypePl - LdStOp.TypeP], 
Typcfl] - -UiStOpTypeP], 
Type{01 - LdStOp TypcP] AA -(UStOpTypctfl 
&A LdStOp.TypeUD 
25 defrntt: Typct2.-0)-*bO00 } 

•RUYD* i* a fecial regiater thai dtaabka th* second regiater unit RUY for debugging. 
LDbnm LDfanm - (Opld- RegOp) 7 RegOp.! : LdStOp. LD 
//dooH care if not RegOp or LdStOp 

SrclReg{4:0] 

30 if (OpU-RegOp) SrclReg « 

RegOp.Srcl;SrclRegf2J&« - (LdStOp J>Sz« IB) 
ebe 

SiclReg - {l*bOXdStOp.Bue} 

//don't care [foot RegOp or LdStOp 

35 Sfc2Rcgl4K)) 

if (Op!d -RegOp) Scc2Rcg - 
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R*fOp.Sfc2; Src2Rejf2| A- -(LdStOp.DSz- IB) 

Stt2B*t - {l*bO,UlSt€)p.bvtot} 

//doo*i cue if not SUgOp or LdStOp 

5 DwliUjJ4.-0J 

if (OpM-UMJ40p)D«lR<* - {lW.UMMOp.Datt} 
«buf (Opld-LdStOp LdScOp.Type-STUFD) 
DettRcf - (l*bO ( IittiOpBa»} 

•*«< 

10 DettSUf - IdWynan 

DcttRegfll - DettIUff2] ~<LdStOp.D&z-lB) 
) //don'l c*re if noo-STUPD SlOp 

SrcSUUf{4:0] SfcSUtcg - LdStOpJUu 

SrcStR£f(2] - SicStRegP) -(LdStOp.D&z- tBAA 
15 LdStOp.DsURef-tO) 
//don't eve if not StOp 
SnlBM(l:0]. Src2BM[t:0J, tad Srcl2BM(2J 
if (OpM-RefGp) { 

SrclBMIO] - -(RejOp.DSz-lB) + -RcgOp.Sfcl(2] 
20 SiclBMfll - ~(RcgOp.DSz-]B) + RcgOp.Sfcl(2) 

SrcIHMfOJ - -(RcfOp.DSx-lB) + ~KcgOp.Sfc2f2| 
+ EefOp.I 

Sit2BM[l) - ~(RcfOp DSi-lB) + RegOp.Src2(21 
-RcfOp.1 

25 if (RcjOp.Type-lOOOlx) Src2BM[l] - 

SrclBMIU - l*b0 //if ZEXT,SEXT 
Srcl2BM(2] * 0UfOp.DSx-4B) 
if (RcfOp.Type-dOOOlx + 111x00)) Srcl2BM|2]« 
1*W> //if ZEXT.SEXT.CHKS 

30 }dae{ //ebeUIStOp or don't care 

SrelBMfl:0) - SrclBMllrO) - 2*bll 
Sfcl2BMf2] - (LdStOp.ASz-4B) 
) // dooVctie if UMM 
SttS&MfrO) 

35 if (LdStOp.Type-xOxx) { //STra Op« 

SccStBMIO] - -(LdStOp.DSz-lB) + 

-LdStOp.D*U(2) 
SicStBMII] - ~(LdStOp.DSi«lB) + 
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LiScOp.D«ta|2] 
SicSlBMIH - (LdStOp J)Si-4B) 
) ciao 

SrcStBMPrO] - <b000 //CDA.CUXEA Opa 

//dool care if not StOp 
Oplnfb(12.-0] Oplnfo(12] - Op{35] 

Opbfc(H:3) - (OpId-UMMOp)T *bllll : Op [34 31) 
OpXnfo(7K)I - OpP(h25), Opf23:22] 
Statep.-O] State - 

(-OpQV + Opld-SpMOpSpMOp.T>pe*(LDKxx + LDxHA) + 
Gpld-UMMOp)7 *bllll : *b0000 
Exact Exact - X 
DetfBM(2:0I 

if (OpId-UMMOp) { 

if(UMMOp.DestRef-tO) 

aba 

DeatBM - *blll 
} ebeif (Opld- LdStOp LdStOp Type -STUFD) { 
DcatBM(lK)] - 2*bll 
De«BM(l| « (LdStOp.ASz-4B) 
>alaa{ 

DealBMfO] - -(LdStOp JXSz- IB) + 

~LdStOp.D«ta(2] 
DaatBMdl - ~ (LdStOp .DSz- IB) + LdStOp.Data{2) 
DeatBM(2] - (LdStOp .DSz-4B) 

) 

if (-OpQV + DeaOUf-'bOUU ♦ 

(OpId-LdScOp UStOpType-ST/STF)) 
DcatBM - 3*b0 
Dc«lVaU3l.*0] DeatVal - «wkch(Opld) { 

cue SUfOp: acxtCBcgOpJnnD8) 
caaa LdStOp: scxt(LdSlOpJ>UpS) 
can UMMOp: {UMl^.lznmHi,LIMMOp.immLo} 
caaa SpecOp: if (SpecOp Type * BRCOND & 
- DBCOpQSd_E) DECAllNcxtlFC 
dae aext(SpecOp.Ifnml7) } 

SutMod(3^1 
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SttfMod - (OpQV OpId-RejOp UcfOpSS) ? RcjOp.E* : 4*b0 
StotValRH)] StatVtl - l*hX 
DBNP.O] DBN • 4*bO 

The following pseudo-RTL code describes generation of each Op quad entry field 
which is associated with an Op quad rather than a single particular operations. 

Emcode Emcode = DECOpQSdE + DEC_Vcc2Emc 

//treat vectoring Op quad as part of emcode 

Eret Eret = DEC_OpQSeI_E EDR Eret 

FaultPq31:0] FauIlPC = DECJPC 

//the logical PC for the first decoded ml in quad 

BPTInibI14:0I BPTInfo = DEC_BPTInfb 

//info from the current BPT access 
RASPtr(2:0] RASPtr = DECRASPtr 

//the current return address stack TOS 
OpQV OpQV = ((DEC_OpQSel_E) ? EDR_OpQV : DECOpQV ) 

& -Excp Abort -(SCMisPred -BrAbort) 
LimViol LimViol = *b0 
//This Op quad bit is actually loaded one cycle 
//later than all of the other fields above (i.e. during 
//the first cycle that the new Op quad is resident and 
//valid within the scheduler. This is reflected in the 
//description above of this Op quad field. 
ISSUE SELECTION 
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Each cycle, based on State{3:0] fields of all the entries as of the beginning of the cycle, 
the scheduler 260 selects a LdOp, a StOp, and two RegOps to be issued into the c or re spon d ing 
processing pipelines* LU, SU, RUX, and RUY. The scheduler 260 selects operations for issue 
based on State and Type fields within each entry which are searched "in order" from oldest to 
5 newest operations in arhrrinlmg reservoir 940. Issue selection does not consider the register, 

status, and/or memory dependencies that each operation may have on relatively older 
operations. 

The scheduler 260 performs issue selection simultaneously and independently for each 
of the four processing pipelines. The selection algorithms for the four pipes are similar. The 

10 next operation (as indicated by its State field) of given type is selected. In other 

words, the next LdOp is selected for load unit 240, the next rniissufd StOp is selected 

for store unit 242, and use next two Unissued RegOps are issued to register unit X and register 
unit Y (the first to RUX, the second to RUY). Conceptually the issue selection for RUY 
depends on RUX, but is performed in parallel with issue selection for RUX. As further 

15 discussed below, some RegOps are only issuable to RUX, and the operation sel e cted for issue 

to RUY is the next RegOp that is issuable to RUY. 



Each scheduler entry generates four bits (Le. a signal) "Issuable to xxx" which indicate 
whether that operation is currently eligible for issue selection to the pipeline xxx where xxx is 
LU, SU, RUX, or RUY. 

20 "Issuable to xxx" - "State«Unissued" && "Executable by xxx" 

Signals that are "issuable to xxx" are generated from State and Type bits of an entry. 
Specifically, "States Unissued" « -SO and "Executable by xxx" = LU/SU/RU/RUY for 
execution pipeline LU/SU/RUX/RUY respectively. The selection process for pipeline xxx 
scans from the oldest scheduler entry to the youngest scheduler entry for entries with bit 
25 "Issuable to xxx" set. For pipelines LU, SU, and RUX, the first operation found with bit set is 

the one selected for issue. Issue selection for pipeline RUY selects the first such operation after 
the operation selected for pipeline RUX. 

Operations are eligible for issue selection immediately after being loaded into the 
scheduler, i.e. an operation can be issued during its first cycle of residence within the 
10 scheduler. In such cases, only the Type bits and bit SO need to be valid at the beginning of the 

cycle. All other field in an entry can be generated as late as during the issue selection phase 
(i.e. up to one half cycle later) and only need to be valid within a scheduler entry for the next 
phase of the procftKBing pipeline. 
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If an operation selected for issue does not advance into stage 0, then the operation was 
not successfully issued. During the next dock cycle, that operation's State field indicates the 
ope r ation is unissued, and the operation competes for issue and will probably be selected again. 

In one embodiment of the invention, scheduler 260 scans the operations mawg scan 
5 chains formed from logic circuits in the entries. Each scan chain is similar to a carry 

such as used in high speed adders. For a scan for operations to be issued to the load unit, the 
store unit, or register unit X, a "cany" bit Cin input to the entry for the oldest entry is set and 
propagates through the scan chain until logic in one of the entries kills the scan bit An entry 
kills the carry bit if the entry corrnapourts to an operation of the type ******** (i.e. "Issuable to 
10 xxx"). To scan for an operation to be issued to register unit Y, a "carry" bit is generated by an 

entry ■ssociatad with the operation to be issued to register unit X, and that carry bit propagates 
until killed by an entry associated with an operation issuable to register unit Y. 

DISPLACEMENT FQRWARPING 

During the operand forward phase, scheduler 260 fetches and forwards displacement 
15 operands to the LU and SU processing pipelines in addition to the register operands for these 

units. The LU and SU each have three input operand buses 952 which carry two register 
operands and one displacement operand. Displacement operands axe 32-bit quantities, but some 
bytes may be undefined. An execution unit does not use undefined displacement bytes during 
correct operation. 

20 Scheduler 260 handles displacements in a manner similar to operation register result 

values. The displacements are stored until used within the 32-bit DestVal fields of entries and 
are driven onto displacement buses 950, as is appropriate, during the operand forward phase of 
LU/SU stage 1. Displacements are always immnriiatff values within scheduler 260, so that 
forwarding displacement values from register file 264 does not occur. Field DestVal is also 

25 used for result values from LdOps and some StOps of LdStOp's, but the two uses do not 

conflict since a result value is not loaded into a scheduler entry until after use of the 
displacement value, i.e. not until after stage 0. 

Small (8-bit) displacements, which are specified within operations, are handled 
differently from large (16/32-bit) displacements. Op substitution circuit 522 sign extends a 
30 small displacement before loading the small displacements into the DestVal field of the entry 

holding the associated LdStOp. Large displacements are presumed to be stored in the DestVal 
field of die entry immediately preceding the LdStOp using the displacement. Generally, the 
preceding entry holds a "LIMM t0,[disp]" operation. 
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The selection of DestVal values to drive onto the displacement buses 950 during each 
cycle does not require scanning of scheduler entries. Instead, each entry determines from its 
State and Type fields whether to enable its own drivers or drivers in a preceding entry to assert 
a DestVal field value onto the appropriate displacement bus 950. The following equations 
m i | unia ri»» the ff fflM fng the displ acemen t bus drivers within each entry. 

Dc*_LU - (thi»Op<LU -SI SO -LD) + 

oextOp<LU - S 1 SO LD)) 7 De*V«l : 3TbZ 
Dup.SU - 0hhOp(SU ~St SO -U>) + 

ocxtOtfSU -SI SO LD)) 7 DutVtl : 32*bZ 



Values "thisOp" and "nextOp" identify the physical entry from which come the 
following minterm's literals LU, SI, SO, and LD. Also, in the case of the first/youngest entry 
in scheduler 260, the NextOp minterm is zero. 

IMMEDIATE FO RWARDING 

In accordance with the exemplary format of RISC obstructions for execution engine 
222, """K*** values are only src2 operands of RegOps. Scheduler 260 handles 
values in a similar manner to displacements but as part of the register operand forwarding 
mechanism. Like displacement values, immediate values are stored in me DestVal fields of 
entries; but like register operands, immediate values are forwarded over register operand buses 
952 (specifically the RUXsrc2 and RUYsrc2 operand buses). 

The exemplary RISC instruction set uses only small (8-bit) immediate values in RegOps 
and stores the immediate values in field DestVal of the entry holding the RegOp. Src2 operand 

values are forwarded to respective register execution units during the operand 
forward phase of stage 0 in place of a register value. The selection of a register value source 
(i.e. a scheduler entry or the register file) is inhibited, and the entry in question directly drives 
its DestVal field onto the appr o pri ate src2 operand bus 952. 

The inhibition of any RUX/RUYsrc2 operand selection is performed during the 
operand selection phase through masking of the generate signal that would normally be asserted 
by the entry holding the RegOp whose operands are being fetched. This is done separately and 
independently for RUXsrc2 and RUYsrc2 and prevents selection of any entry by the 
RUX/Ysrc2 scan chain and selection of the register file as the default operand source. The 
previous operand selection scan chain equations exhibit the inhibition. 



SUBSTITUTE SHEET (RULE 26) 



WO 97/13194 



62 



PCT/US96Y15422 



The selection of 'immediate* DestVaJ values to drive onto the RUXsrc2 and RUYsrc2 
operand buses during each cycle does not require ™™g of scheduler entries. Instead, each 
entry enables die driven of its DestVaJ field onto the appropriate operand bus simply based on 
its State and related bits. These are the same drivers used for normal register value forwarding; 
there is simply an additional term in each enable equation for handling operands. 
The following summarizes these terms as separate equations enaMing lyp^raty bus drivers 
within each entry. 

Opnx1_RUXac2 « (RU Excel —SI SO bum) ? DmVil : 32*faZ 
OpraJ RUY«rc2 - (RU -Excel —SI SO faun) T De*Val : 32*bZ 

When an entry drives an immediate value onto an operand bus 952, the entry also 
drives the smortate d operand status bus 954. The same bus drivers and driver input values as 
for normal operands are used for immediate values; there is simply an additional term in each 
enable equation (the same additional terms as above). The following summarizes these terms as 
separate equations enabling separate bus drivers: 

OpmdStit_RUX0c2 - (RU Excel —SI SO 1mm) 7 OpradSut : 

IPbZ 

OpmdSut_RUYac2 - (RU -Excel -SI SO bum) 7 OpradSut : 

10-bZ 

PATA QPERANP FETQflNQ 

StOps have three register source operands and no register drains tkm ^ in contrast other 
operations have to up to two source operands and one destination. The third source operand for 
a StOp provides the data to be stored and is sometimes referred to herein as a data operand. 
The data operand is not needed to start execution but is needed for completion of a StOp. 
Fetching of StOp data operand values is perform ed in a manner similar fetching other source 
operands, but is synchronized with SU stage 2. Whereas the "normal" operand fetch process 
occurs during the issue stage and stage 0 of operation processing, the data operand fetch process 
occurs during SU stages I and 2. Scheduler 260 detects a data operand that is not yet available 
during SU stage 2 and holds the associated StOp in stage 2. 

The data operand fetch pr oce ss is largely the same as issue and operand fetch stages 
described above with two principal differences. The first difference is that an operand selection 
phase for the StOp currently in SU stage 1 replaces the issue selection phase and does not scan 
across scheduler entries to choose between multiple candidates. Instead, the entry flc^n^d 
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with the StOp it SU stage 1 identifies itself from the State and Type fields. The second 
difference is that the Opinio field of the StOp does not need to be read out (again) during a 
broadcast phase for the data operand. Instead, store unit 242 retains the Oplnfb value read out 
when the StOp was issued and uses that Oplnfb field during the following phases. The Oplnfb 
value read out during the SU Issue stage is passed down with the operation through stages 0 V 1, 
and 2 the SU pipeline. 

The following equation describe the signals generated for the data operand fetch. 
SU stage 1: Operation Selection Phase 
"Select for data operand frich* - SU -S2S1 
SU stage 1: Data Operand Broadcast Phase 
SfcStIafbr7^)] - {SccSlBMI2.-0] f SrcStResf4A]} 
OpradInib_SUcroSt - 'Select for data operand fetch* 7 

SrcStlnfb : StZ 
"match with operaad SUercSt" - OprtvjMafchjSUarcSl 
- (buaRagHH)] «- Da*ffegi4:0D 

(buaBMII) DeatBMTI] + buiBMfO) DestBMJID 
where *bua' asfera to Opcndfcafo SUercSi. 

The match signal is then latched into a pipeline register bit within each entry: 

tf<3UAdv2) OprodMatch_SUarcSt - 
"inairh with operand SUarcSt" 
SU stage 2: Data Operand Selection Phase 
Bit-level scan equations: 

-P-K - OprndJ4alch_SUercSt 
O - SU -S3S2 

Group-level scan equations are the same as for other operand selection scan chains. 
"Supply OFiresufc value to SUarcST - 
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SU«cStchain.CINi SUncStcfaainJG 

SU stage 2: Operand Forward Phase 
Enable for driver within each srhnrttilfr entry: 

5 -Supply Op rem* vakw to SUscSi* ? DestVal : 32*bZ 

OpndSttt.SttacSi - 

'Supply Op mult value Co SUarcSt" ? OpradStttlO'bZ 

Enable for driver at output of register file: 

Opmd$U*fcSc - 

10 $UncStchua.COUr ? SU«cStR*jV*J : 32*bZ 

OprodSUt_SUttcSt - 

SUncStcbun.COUT? (Tbll! 11 1 U'bXXX} : 10*bZ 

The data operand Opmd_SUsrcSt transferred over bus 555 is captured in a pipeline 
register at the input of store unit 242. During this phase, control logic (not shown) uses the 
15 operand status value read. 

REGOP BUMPING 

'Normally", operations issued to a given processing execution unit progress down the 
pipeline in fixed order with respect to other operations issued to that pipeline. When an 
operation is held up in stage 0, for example, the operation currently being selected for issue to 
20 that pipe also gets held up because operations cannot pass by each other within a p r oce ssing 

pipeline. Scheduler 260 generally manages the p ro cess in g pipelines based on in-order issue 
selection and processing. 

One exception is made. When a RegOp is held up in stage 0 of either register unit due 
to one or more unavailable operand values, the RegOp may be bumped out of the p rocess in g 
25 pipe and back to the state. This amrnmtK to setting the RegOp's State field back to 

bOOOO. When a RegOp is bumped out of a RU stage 0, the following RegOp selected for issue 
to that register unit advances into stage 0, immediately taking the place of the bumped RegOp. 
Simultaneously, the tamped RegOp is immediately eligible for reissue to a register unit, not 
necessarily die same register unit 
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Bamptng is applicable to all RegOps and is subject to the following constraints. First, 
an RUX-oaly RegOp (in RUX stage 0) cannot be bumped if an RUX-only RegOp is cuncatly 
being selected for issue to RUX because bumping would violate the guaranteed that RUX-only 
RegOps are executed in order with respect to each other. Secondly, a RegOp should only be 
bumped if it will be stalled for more than one cycle, otherwise it is generally better to leave the 
RegOp in stage 0 waiting to advance to stage 1. The above equations mdir^Kng toe SI State 
bits of entries implement RegOp bumping. In addition, global control logic (not shown) 
generates the global bump signals (BumpRUX and BumpRUY) which force assertion of the 
RUXAdvO and RUYAdvO signals. 

STATUS FLAG HANDLING 

The handling and use of status flags, both x86 architectural flags and 
rmcfo-erchitectural flags, involves three areas of functionality: fetching status flag operand 
values for cc-dep RegOps, fetching status flag values for resolution of BRCOND operations, 
and synchronizing non-abortable RegOps with preceding BRCOND operations. Unlike logic 
(not shown) for register operands and LdOp-StOp ordering, logic (not shown) supporting status 
flag functions is not spread across all scheduler entries. Status flag handling for related 
operations only occurs while operations that access status flags are within certain Op quad 
entries. Cc-dep RegOps must be in Op quad entry 3, the middle Op quad entry within 
scheduling reservoir 940, during the cycle when status operand fetching occurs (i.e. during 
RUX stage 0). BRCOND operations and non-abortable RegOps must be in Op quad entry 4 
during resolution by branch unit 2S2 and RUX stage 0, respectively. 

Cc-dep and non-abortable RegOps are held up in RUX stage 0 if they have not yet 
shifted down to Op quad entry 3 and 4 respectively. Conversely, scheduler Op quad shifting is 
inhibited until such operations are successfully able to advance into RUX stage 1. These 
restrictions allow logic 535 to be simpler and smaller. For example, the fetching of appropri ate 
status flag values for cc-dep RegOps and BRCOND operations only occurs across the bottom 
three scheduler Op quad entries and can be performed independently for each of four groups of 
status flags corresponding to the four StatM od bits of an operation. Logic can be shared or 
utilized for both cc-dep RegOp status operand fetching and BRCOND operation resolution, and 
the synchronization between non-abortable RegOps and BRCOND operations is simplified. 

In addition, passing status flag values directly from either RegOp execution unit to a 
cc-dep RegOp entering the RUX execution unit, is not supported. The result is a minimum one 
cycle latency between die execution of a *.cc* RegOp and the execution of a following 
cc -dependent RegOp. The statistical perfomiaa ce impact of this latency is minimal because cc- 
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dependent RegOps rarely occur in a consecutive order. Further, any impact can be elimmsted 
by erncode and MacDec ordering operations to diimnatr consecutive cc-dependent RegOps. 

To further aid the atmpHfication and reduction of logic, a number of restrictions are 
placed on where cc-dep RegOps, fiRCOND operations, and non-atortable RegOps can occur 
relative to each other within Op quads. These generally translate into emoode coding rules, but 
in some cases also ora wtiai ii MacDec decoding of multiple mi's in one cycle. In particular, the 
restrictions are as follows: 

1) No ".cc* RegOps after a BROOND operation, 

2) No cc-dep RegOps after a ".cc" RegOps. 

3) No non-abortable RegOps in a quad with a BRCOND Op. 

4) Only one cc-dep RegOp within an Op quad. 

5) Only one BRCOND operation within an Op quad. 

6) Only one non-abortable RegOp within an Op quad. 
STATUS FORWARDING TO CC-DEP REGOPS 

During each cycle, the four operations within Op quad entry 3 are to 
determine whether any of them is a cc-dep RegOp. If one is, then the specific type of the 
RegOp is decoded to determine which groups of status flags are needed, and the StatusV bits 
are checked to determine whether all of those groups are, in fact, valid. Concurrently, 
Status[7:0] is blindly passed to the RUX execution unit 

If all of the required flag groups are currently valid, then the RegOp is allowed to 
advance into RUX stage 1 at least insofar as the status operand fetch is concerned. If the 
RegOp does not immediately advance into stage 1, though, then shifting of Op quad entries 3 to 
6 is inhibited. If any of the required flag groups are not currently valid, then the RegOp is held 
up from advancing into RUX stage 1 and shifting of scheduler Op quad entries 3 to 6 is 
inhibited. 

If there is no unexecuted cc-dep RegOp in Op quad entries 3 to 6, but there is a cc-dep 
RegOp in RUX stage 0, then the RegOp is unconditionally held up in stage 0. If a cc-dep 
RegOp in Op quad entry 3 has not yet executed, but there is no cc-dep RegOp in RUX stage 0 
or there is an unexecuted cc-dep RegOp in Op quad entries 4 to 6, then shifting of Op quad 
entries 3 to 6 is inhibited. 
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There is an additional input from the RUX unit (RUX_NoStatMod) which indicates that 
the operation being executed there does not modify status flags. A cycle-delayed version* called 
"NoStatMod," is useful in several situations. The following equations describe this logic. 

CCDeploRUX 0 - (Opinfo JtUX JKRegOp)TypeJ3 :2] - *b01) OpVRUXjO 
5 UorocOCPcplaQ3 ■ 

OP12:(RUOpInfc<R^Op).Typ^3J)- , W}1 —SI) + 
OP13:(RU OpInfo(Sl««Op).Typc(3 JJ-*b01 -SI) + 
OPM:(RU Oplofo(ResOp).T>pe(3a)-'b01 -SI) + 
OP15:(RU Optefo(R*gO|>)/I>pcP :2] - *b01 -SI) 
10 If (-OpInfe_RUX v 0(R«sOp).l>pe(5D 

StaiV - Stt&nVP) 

//need CF for ADCjSBBJtLCJUtC Op» 
ebetf (OpInfo_RUX_0(RcfOp).T>pcU K)] - *blO) 

SutV - StatnsV[0] tfamd BZFJBCF for MOVcc Op 

IS dee //oeed OF CF for MOVcc,RDFL0*DAAJ)AS Ope 

SmV - Stems V{3] Sta&taVTl] SmuV(l] 
//keep track of when en unexecuted ee-dep RegOp i» in //Op quad entry 3: 
SolExrcCCDcp - CCDepInRUXj0SC_AiMlUXO -BwnpRUX 
//keep track of whan an unexecuted cc-dep RegOp ia in //Op quad entry 4 
20 Octk:tf(UiEntry4 ^SmExccCC^^SC^EAbori) 

UnaxacCCOeplnQ4 - LdEnny4 UoexecOCDcpInQ3 
-SmBxecCCDcp -SC_BAbort 
//hold copy of etatua flat values at input to RUX //exeeutknuaii: 
SCJfoldStafaia - UnexecCCE)cpInQ4 
25 //bold RcfOp execution if 

SttluiInvUJlUX -(CCDeplnRUXO - Uoex«CXXtepInQ4) 

~(UoeX£cCCDcpInQ3StttV -NoStalMod) 
//bold Op quad from afaiftiqf out of arhartiilrr quad 3 //if 
HoldOpQ3 - 

30 UnexecCCDepInQ3 ~(CCDcplnRUX_0'StatV -NoStalMod) + 

UnexecCCDcpbQt 



BRCQNp QFERATON RESQLVTIPN 

During each cycle, Op quad entry 4 is checked for a BRCOND operation. If one is 
35 found, then the condition code (cc) field of that entry is decoded to select one of 32 condition 

value combinations and associated valid bits. The value and validity of the selected «mditi^n 
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are then used to inhibit scheduler shifting of quad entries 4 to 6 and/or to assert pipeline restart 
signals if necessary. 

If a BRCOND operation is found to be mispredicted (and thus a pipeline restart is 
required), the restart signal is asserted based on whether the BRCOND operation is an 
"MacDec" or frnHf operation and whether it is from internal or external emcode. In 
addition, an appropri ate xB6 macfoinstntction or emcode vector address and an return 
address stack TOS value are generated, 

For the benefit of the logic handling syDehrorriiarion b etw een non-abortable RegOps 
and pr ece din g BRCOND operations (described in the next section), record is also maintained of 
the occurrence of a mispredicted BRCOND operation while it remains outstanding (i.e. until an 
abort cycle occurs). Further, the existence of an outstanding mispredicted BRCOND operation 
is used to hold up the loading of "new" Op quads (from the 'restarted* MacDec) until the abort 
cycle occurs. 

If a BRCOND operation was correctly predicted, the only action taken is to set the 
BRCOND operation's State bit S3 to indicate the BRCOND is completed. 

The following equations describe all of this logic. Reference is made below to W 
and *SSTF" signals - these are signals indicating breakpoint and single-step traps, resp e cti vely. 
There is also a signal called "MDD, " for "multiple decode disable" which can be used for 
debugging to prevent more than one macroinstruction from being inserted into the scheduler at a 
tune. 

BRCOND16- 

OP16:(I>pe-Sf»cOp OpInfo(Sp«cOp).Typ© — BRCOND -S3) 
BRCONDI7 - 

OPt 7 :(Type -SpecOp Oplnfb(SpecOp).Typ« - BRCOND -S3) 
BRCOND 1 8 - 

OPtt:(rypc-SpocOp O pfafo (SpecOp).Type- BRCOND -S3) 
BRCOND19 - 

OP19:(T>pc»SpocOp Op Iolb(SpocOp).Typ«— BRCOND -S3) 
BRCONDIoQ4- 

(BRCONDI6 + BRCOND17 + BRCOND 1 8 + BRCOND 19) OPQ4:OpQV 
CoodGode[4:0) - 

(HRCONDI6)*5 OP16:OpIirfb(Sp*cOp).Cq4:0) + 

{BRC0ND17}*5 OPl7rOpInfo(SpecOp).CCf4:0] + 

{BR00ND18)*5OP18K)plnfi)(SpecOp).CCf4K)l + 

{BROONDI9)*5 OPl9rOpInfb(SpflcOp)XX34:0) 



SUBSTITUTE SHEET (RULE 26) 



WO 97/13194 PCT/US96/15422 

69 



CeodV - witch <CoadOode{4:lD { 
cue 0000: *bl 
cm 0001: StstmVfO] 
cue 0010: Stefi»V(0] 
5 a» 0011: SU&tsVP) &etu»V(2] 

cueOUXfcSutueVfO] 
cue 0101: SwwVfO] 
cam 01 10: SttCzsVfO] 
cue 01 It: StatuaVtO] StotaVP] 
10 cam 1000: StrtaVP] 

cue 1001: SumeVU) 
cm 1010: SttfiuVP) 
cue 1011 : StantaVP) SunaVfU 
cue ll00:SumaVP] 
15 cam 1101:Stattt»Vf2) 

cam 1110: SUftaV(3] Stefi»V(2] 
cam 1111: SufluVP] Sca&taVP) ) 
//tay active hfw bttenupc nqpmtal: 
IP - SI_NkGP + SIJNTRP 
20 CoadVaJ - twitch (CandCod«(4:ID ( 

cue 0000: CoodCodcJOJ * *bl 
cue 0001: CaodCodefOp Sta&nfO) 
cue 0010: C&odCodefO] A SUtttefl) 
cue 0011: StrtuUI + (CoodCodc[0] A -StitaifSD 
25 cue 0100: CofldCodelQT 

(-Statud! -IP ~(DTF4-SSTF+MDD)) 
cue 0101: CoodCode[OP 

(-Su&ism -IP — (OTF+SCTF+mDD)) 
cue 0110: CoodCodcfOp 
30 (-StatmfO] —IP — (DTF+SSTF+MDD)) 

cueOtll: -SttufH -IP -(TJTIV+SSTF+MDD)*. 
(CoodCode(0) * SuHu(5J) 
cue 1000: CoodCode{Or Sto&»m 
cue 1 001 : CoodCode[0) ~ StBtBtfZl 
35 cam 1010: C o^dCod c( 0r Sut&s(5) 

cue 101 1: CorejCodcfOp (Stt&uffl + StesusflD 
cue 1100: CcadCafeJOp Stt&uf6) 
cue lI01:CoadCode(Op SU£u*|3) 
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cam 1110: CoodCod«{0] ~ (StatasfT) A Ststua(6D 
cms llll:CondCode(0]~ 

((SMu(7] * StttostfD + SttfupD } 

// the definition of ConSCode(4:l) t» as follows 
5 //(bit 0 ffips ibe Mats): 

True 4*b0000 

BCF 4*b0001 

EZP 41)0010 

SZaZF 4*bO0U 
10 MCTRZ 4*b0100 

SIRZ 4-bOIOl 

MSTRC 4-bOUO 

STRZaZF 4*b01U 

OF 4*bl000 
15 CF 41>1001 

ZF 4*bl010 

CvZF 4*btOU 

SF 4*bU00 

FF 4*bH01 
20 SxOF 4^1 110 

SxOvZF 41)1111 

//bold Op quid from stuffing out of Op quad entry 4 //if 
HoUOpQ4A - BRCONDlnQ4 -CoodV 
SCJUsolve - 

25 BRCONDInQ* CoodV -SCJtfisPred -NoSutMod ~OFQ4:£mcodc 

//remember wtsnhrtton of a BROOND opcratiosi ia quad 4: 
Qdk: Resolved » — LdEatry4 (SC_R*sotve 4- Resolved) 
//tcraiBnB REP MOVS emcode loop if almost door? 
//use CS "D* bk suppfied by RUX to aid lennisnioo in //16-bit cue 
30 TermMovt - BRCONIHnQ4 CoodV -NoSutMod -SC_Mi*Pred A 

((CondCode(4:U - *b0UQ) <QP19 :De*Vai(t5:0] * 
16*b5) & (OPI9:De*VtI{3 1:161 - I6"b0 + RUX_D) + 
(CoodCodeH:!] - *b0100> (OP23:DeatVal[l5:01 - 
161^) A (QP23 J>eatV«iPl:16] - 16*b0 + RUX.0)) 
35 //CoadCod**MSTRC ... + CcsrtCode-MSTRZ... 

dcOt: TermedMOVS - - LdEotry4 (TcnnMOVS 1 1 TermedMOVS) 
SC TermMOVS - TennMOVS 4 TermedMOVS 

//get emcode or instruction vector address Car handling //mispredicted branch 
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BrVeeAddrf31:01 - {BSOONDI6)*320P16JMVaQ31.0] + 

{BRCONDI7}*31 OPI7 J)«ftV«U3l H)] + 

{BBCONDlB}*32OPlB:D>*V«tf31:0] + 

{BSOONDt9)*32OP19J>e«VaII31.-0) 
//wppty old RAS TOS per to MacDac for wnring if //BROOND o p cra i inii nuaprcdktod: 
SC_OUKASPtr|2K>] - OPQ4:RASPtr|2:0] 
//■apply old BPT iofb to MacDee for imonaf 
//if BROOND openlkn nopivdictad: 
SCJNdBPIWbfUKU - <»Q4:Br*nafo«14K)] 

//Hpprjr ridier fiwh PC or ■■■raati bciach addrea* to //MacDec if BRCOND norr niou mi aptcdi c ie d: 
SC_Re*aftAddifJ!:0] - ExcpAbort T OPQStFaukPC : 

((OPQ4:Emcodc) ? OPQ4:Feu*PCT31:0] : 

BrVeeAddrpiKID 
//untitle mttit if BRCOND opentioa mispredicted: 
BrVec2Emc - SCRceolve -CondVal OPQ4 : Encode 
BrVcdDcc - SCReeorvo -CoadValOPQ* : -Emcode 
//lOMnnocr mis pre dict io n! 
Odk: if (SCJtMohw + SCAbort) 

SCJtfisPred - -SCJtboflC-CoadVd + SC_M»rW) 
//mark BROOND Op as Cmnitol e d if correctly predicted: 
9dk: if (SC_ReaorveCoodVeJ BRCOND 16) OPl6:S3 - *bl 
Odk: if (SCReeofce CoodYd BRCOND 17) OP17:S3 - *bl 
Qdk: if (SC_ReaorveCorjdV«JBRCOND18)OPU^3 - *bl 
©elk; if (SC_Reeo*ve CondVal BRCOND19) OP19:S3 - *bl 



A BRCOND operation being successfully resolved may sit in Op quad entry 4 for more 
than one cycle due to Op quad entries 5 and 6 not being able to shift and thus preventing Op 
quad entry 4 from shifting down. During this time SC_Resolve« 1 and one of signals 
BrVec2XXX on buses 557 remains asserted for the entire time (versus for just the first cycle). 
This is alright since instruction decoder 220 keeps restarting each cycle until signal BrVec2XXX 
deasserts. All of the other associated signals such as the vector address "minium proper values 
throughout this time. 

NON-ABORTABLE RegQp SYNCHRONIZATION 

During each cycle, Op quad entry 4 is checked for a non-abortable RegOp. If one is 
found, then scheduler 260 checks for any preceding mispredicted BRCOND operations. Due to 
emcode coding constraints, any preceding BRCOND operations must be in a lower Op quad 
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entry and thus most have all been resolved. Further, any BRCOND operation currently being 
resolved is g uaran teed to tall after the non-abortable RsgOp and thus is irrelevant 

If no such mispredicted BRCOND operations exist, then the RegOp is allowed to 
advance into RUX stage 1. If the RegOp does not immediately advance into stage l f it is still 
allowed to shifted out of Op quad entry 4. 

If Op quad entries 4 or 5 contain no unexecuted non-abortable RegOp but there is a 
non-abortable RegOp in RUX stage 0, die non-abortable RegOp is unconditionally held up in 
stage 0 the non-abortable Reg-Op reaches Op quad entry 4. If there is a non-abortable RegOp 
in Op quad entry 4 that has not yet executed, but there is no non-abortable RegOp in RUX stage 
0 or there is an unexecuted non-abortable RegOp in Op quad entry 5, shifting of Op quad 
entries 4 and 5 is inhibited. 

The following equations describe this logic: 

NoaAfalnRUXO - (Op!nib_RUX_0(RegOp).Type[5 :2] - 

^I110>OpV_RUXJ> 
U«ecNooAbbQ4- 

OP16<RUOpInlb(RefOp)Typ«(5:2i- 4 blllO-Sl) + 

OP17<RU Oplnfo(Re»Op).Typc(5:2|-'binO~51) + 

OPISOtU Opbfb<RxgOp)/rype[5:2)-*blI10~S]) + 

OP19(RU OpInfb<RegOp).Type|5:2]«T>inO--Sl) 
//bold RegOp execution if 
NonAbSync - NonAblnRUXJ)* 

( ~ UaexecNonAbbQ* + SCMUPrcd + "trap pending") 
//bold Op quad from dulling out of Op quad entry 4 //if 
Hold0pO4B - U«raNoaAblaQ4 
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Superscalar p r ocess or 120 may be incorporated into a wide variety of system 
configuration*, illustratively into standalone and networked personal computer systems, 
workstation systems, multimedia systems, network server systems, multiprocessor systems, 
5 wnhnrided systems, integrated te le phony systems, video conferencing systems, etc. Figure 10 

depicts an illustrative set of suitable system configurations for a p rocessor, such as superscalar 
processor 120, mat has an instruction decoder which implements a RISC86 instruction set In 
particular, Figure 10 depicts a suitable combination of a superscalar p rocessor having an 
instruction decoder mat implements a RISC86 instruction set with suitable, bus configurations, 

10 memory hierarchies and cache configurations, I/O internees, controllers, devices, and 

peripheral components. While the invention has been described with reference to various 
embodiments, it will be understood that these embodiments are illustrative and that the scope of 
the invention is not limited to them. Many variations, modifications, additions, and 
improvements of the embodiments described are possible. Additionally, structures and 

15 functionality presented as hardware in the exemplary embodiment may be implemented as 

software, firmware, or microcode in alternative embodiments. For example, the description 
depicts a rnacroinstruction decoder having short decode pathways including three rotators 430, 
432 and 434, three instruction registers 450, 452 and 454 and three short decoders SDecO 410, 
SDecl 412 and SDec2 414. In other embodiments, different numbers of short decoder 

20 pathways are employed. A decoder that employs two decoding pathways is highly suitable. 

These and other variations, modifications, additions, and improvements may fall within the 
scope of the invention as defined in the claims which follow. 
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\*T5 CLAIM; 

1. In a RISC-like processor core for executing a plurality of RISC-like operations in 
parallel, the RISC-like operation! being operation#of a RISC4ike instruction set converted from 
a set of CISC-like instructions, die RISC-like instruction act conmriaing: 

a plurality of najtuaUy-unifbrm bit-kngth instruction codes, each code being divided 

into a plurality of defined-usage bit fields and the codes being classified into a 
plurality of instruction classes, the code in an instruction having a 
mutiiaUyHxxisistent defmitkm of defined-usage bit-fields including a bit-field 
that is mapped using indirect specifiers so that a single mutually-uzufbrm bit- 
length instruction code maps into a plurality of instruction versions. 

2. An mstruction set according to Claim 1, wherein the instruction Hpffmr includ e* 

a register ope ra t ion (RegOp) dan including arithmetic operations, shift operations and 
move operations and having defined-usage bit-fields including an operation 
type field, three operand bit-fields for rf^gnating a first source operand, a 
second source operand and a destination operand, a bit-field for designating a 
data size of the operands. 

3. An instruction set according to Claim 2, wherein the RegOp class further i™»Mr? 
an operation that modifies one or more status flags, the RegOp class further having defined- 
usage bit-fields including: 

an extension bit-field for specifying the one or more status flags that are modified by 
the operation; and 

a set status bit-field for causing the operation to modify status flags in accordance with 
the extension bit-field. 

4. An instruction set according to Claim 2, wherein the RegOp class further includes 
special register read and write operations and has defined-usage bit-fields including: 

an extension bit-field for specifying a condition code for a conditional move instruction 
and for specifying a special register to be read and written by the «p~»m 
register read and write operations. 

5. An instruction set according to Claim 2, wherein the RegOp class further has 
defined-usage bit-fields including: 

a bit-field for de s i gn ati n g an execution unit for executing an operation. 

6. An instruction set according to Claim 2, wherein the RegOp class of operations 
includes an operation for writing a program counter to redirect processor execution. 
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7. Ad instruction set according to Claim 2, wherein the RegOp class of operations 
mcmdes a check selector operation for simultaneously checking a memory segment selector for 
access permission and loading the selector if permission is attained. 

8* An instruction set according to Claim 2» wherein the RegOp class of operations 
includes a check selector operation (CHKS) for simultaneously a memory segment 

selector for access permission and loading the selector if permission is attained. 

9. An instruction set according to Claim 2, wherein the RegOp class of operations 



10 

a write descriptor real (WRDR) operation for loading a segment register in a real mode 
fashion; 

a write descriptor protected mode high (WRDH) operation and a write descriptor 

protected mode low (WRDL) operation for performing a sequence of 
15 checking operations for checking data segments and I/O address space 

as an emulation code se q uence of hardware primitives. 

10. An instruction set according to Claim 1 wherein the indirect specifiers specify an 
instruction parameter selected from the parameters including an operating register, a data size 

20 and an address sue. 

11. An instruction set according to Claim 1, wherein the instruction classes include: 
a load-store operation (LdStOp) class including load and store operations and having 

defined-usage bit-fields including an operation type field, a plurality 
25 of bit-fields for designating a load-store address in memory, a bit-field 

for designating a data so«rce-desri nation register for sourcing- 
receiving data from the load-store address in memory, and a bit-field 
for designating a data size of the source-destination data. 

30 12. An instruction set according to Claim 1 1 , wherein the plurality of bit-fields for 

designating a load-store address in memory in the LdStOp class includes: 

a segment register for designating a memory segment of the load-store address; 

a base register for designating a base of the load-store m emory address; 

an index register for designating a memory index; and 
35 an index scale factor bit-field for designating an index scale factor. 

13. An instruction set according to Claim 12, wherein the LdStOp class of operations 
includes a check data effective address (CDA) operation. 
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14. An instruction set according to Claim 12, wherein the LdStOp class of operations 
incl ude s a check instruction effective address (CIA) npmti™ 

15. An instruction act according to Claim 12 t wherein the LdStOp class of operations 
5 includes a store integer data with base register update (STUPD) operation. 

16. An instruction set according to Claim 12 v wherein the LdStOp class of operations 
includes an operation for invalidating a TLB address (TIA). 

10 17. An instruction set according to Claim 1 , wherein the instruction classes inrhtd g! 

a load operation (LdOp) class including load operations and having defined -usage bit- 
fields including an operation type field, a plurality of bit-fields for 
designating a source address in memory, a bit-field for designating a 
destination register for receiving data from the source address in 

15 memory, and a bit-field for designating a data size of the source* 

destination data. 

18. An instruction set according to Claim 1, wherein the instruction classes include: 
a store operation (StOp) class wirfmifpg store operations having defined-usage bit-fields 
20 induding an operation type field, a plurality of bit-fields for 

designating a store address in memory, a bit-field for rf««gii»friig a 
data source register for aourcing data from the store register, and a 
bit-field for designating a data size of the source-destinatton data. 

25 19. An instruction set according to Claim 1 , wherein the instruction classes further 

include a load immediate operati o n class (LIMMOp) having defined-usage bit-fields wimimy 
an immediate data high (ImmHi) bit-field and an immediate data low (ImmLo) bit-field 

for designating the immediate data value; and 
a bit-field for designating a data destination register for receiving data from the load- 
30 store address in memory. 

20. An instruction set according to Claim 1 , wherein the instruction classes further 
include a special operation class (SpecOp) including a conditional branch operation* a set default 
fault handler address operation, a set alternate fault handler address operation and an 
35 unconditional fault ope ration and having defined-usage bit-fields induding: 

a bit-field for designating a condition code; and 

a data immediate bit-field for designating a signed immediate data value. 
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21. An instruction set according to Claim 20, wherein the special operation class 
(SpecOp) further includes a load constant operation and the defined-usage bit-fields further 
include: 

a bit-field for designating a data destination register; and 
a bit-field tor designating a data size of mnntmt data. 

22. An instruction set according to Claim 20, wherein the SpecOp class of operations 
mrlnrifw an operation for loading a constirtt (LDK). 

23. An instruction set according to Claim 20, wherein the SpecOp class of operations 
includes a load constant data (LDKD) operation. 

24. An obstruction set according to Claim 1, wherein a defined-usage bit-field of an 
operation is designated by a substituted value that is determined by an emulation environment of 
the processor. 

25. A RISC-like internal instruction set for executing on a RISC-like core of a 
superscalar p r oc essor , the RISC-like internal instruction set being translated from instructions of 
a CISC-like external instruction set, the instruction set comprising: 

a plurality of instruction codes arranged in a fixed bit-length structure, the structure 

being divided into a plurality of defined-usage bit fields, the instruction codes 
having a plurality of operation (Op) formats of the plurality of defined-usage 
bit fields, the formats including: 

a first register Op format having a format bit-field designating the instruction code as a 
first register Op format code, a type bit-field designating an instruction type, a 
first source operand bit-field identifying a first source operand, a second 
source operand bit-field identifying a second source operand, a H^n^on bit- 
field designating a destination operand, and an operand size bit-field 
designating an operand byte-size; 

a load-store Op format having a format bit-field designating the instruction 

code as a load-store Op format code, a type bit-field A^gMttn g a load-store 
instruction type, a data bit-field identifying a destination-source of a load-store 
operation, an index scale factor bit-field for ^^"gntting an index scale factor, 
a segment bit-field jWigwaHng * augment register, a base bit-field designating 
a load-store base address, a displacement bit-field designating a load-store 
address displacement, and an index bit-field d«igniring a load-store address 
index. 
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26. An instruction set according to Claim 25, wherein the first register Op format 
further includes: 

an extension bit-field for designating a condition code. 

27. An instruction act according to Claim 25, wherein a bit-field of the instruction 
code is substituted into the Op format, the substituted bit-field being determined as a function of 
processor context. 

28. A superscalar nurropiocegaor comprising: 
a source of CISC-like instructions; 

a RISC-like pr ocess or core for m crating a plurality of RISC-like operations in parallel; 
and 

a decoder coupling the source of CISC-like mstnictums to the RISC-like pr oces so r 
core, the decoder for converting CISC-like instructions into operations of a 
RISC-like instruction set including: 

a plurality of rnntiial ly-uniform bit-length instruction codes, each code being 

divided into a plurality of defroed-usage bit fields and the codes being 
classified into a plurality of instruction classes, each code in an 
instruction class having a mutually-consistent definition of defined- 
usage bit-fields, the instruction classes including: 
a register operation (RegOp) class «tgh«iing arithmetic operations, 
shift operations and move operations and having defined- 
usage bit-fields including an operation type field, three 
operand bit-fields for designating a first source operand, a 
second source operand and a destination operand, a bit-field 
for designating a data size of the operands; and 
a load-store operation (LdStOp) class including load and store 

operations and having defined-usage bit-fields i™*Aidmfl ^ 
operat io n type field, a plurality of bit-fields for A^gntwig * 
load-store address in memory, a bit-field for designating a 
data source-destination register for sourcing- receiving data 
from the load-store address in memory, and a bit-field for 
designating a data size of the source-destination 

29. A mi cr o pr o ce ssor according to Claim 28, wherein the instruction set RegOp class 
further includes an operation that modifies one or more status flags, the RegOp class further 
having defmed-usage bit-fields including: 
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mn extension bit-field for specifying the one or more status flags that are modified by 
the operation; and 

a act status bit-field for causing the operation to modify status flags in accordance with 
the extension bit-field. 

30. A micropmrfanr according to Claim 28 wherein the instruction set further 
inchmVs a load immediate operation class (LIMMOp) having defined-usage bit-fields including: 

an immndiato data high (ImmHi) bit-field and an data low (ImmLo) bit-field 

tor designating die immnrtiste data value; and 

a bit-field for designating a data destination register for receiving data from the load- 
store address in memory. 

31. A micro pr ocessor according to Claim 28 wherein the instruction set further 
includes a special operation class (SpecOp) including a conditional branch operation, a set 
default fault handler address operation, a set alternate fault handler address operation and an 
unconditional fault operation and having defined-usage bit-fields including: 

a bit-field for designating a condition code; and 

a data immediate bit-field for designating a signed immediate data value. 

32* A computer system comprising: 

a memory subs y s tem which stores data and instructions; and 

a pr oces so r operably coupled to access the data and instructions stored in the memory 
subs y s tem , wherein the pr oce ssor includes: 
a source of CISC-like instructions; 

a RISC-like p rocessor core for executing a plurality of RISC-like operations in 
parallel; and 

a decoder coupling the source of CISC-like instructions to the RISC-like 

pr oce ssor core, the decoder for converting CISC-like instructions into 
operations of a RISC-like instruction set including: 
a plurality of nmhuuly-uniform bit-length instruction codes, each code 
being divided into a plurality of defined-usage bit fields and 
the codes being classified into a plurality of instruction 
classes, each code in an instruction class having a mutually- 
consistent definition of defined-usage bit-fields, the 
instruction classes including: 

a register operation (RegOp) class including arithmetic 

operations, shift operations and move operations and 
having defined-usage bit-fields including an 
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operation type field, three operand bit-fields for 
designating a first source operand, a second source 
operand and a destination operand, a bit-field for 
designating a data size of the operands; and 
a load-store operation (LdStOp) class mrinrffa g load &m 
ope ra ti o ns and having defined-usage bit-fields 
mrJuriing an operation type field, a plurality of bit- 
fields for obsignating a load-store address in 
memory, a bit-field for designating a data so u rce- 
destination register for sounang-rccetving data from 
the load-store address in memory, and a bit-field for 
designating a data size of die source-destination data. 

33. In a computer system having a memory subsystem for storing data and instructions 
and a pr oces so r operably coupled to access the data and instructions stored in the memory 
subsystem, a RISC-like processor core for *™-"**"g a plurality of RISC-like operations in 
parallel, the RISC-like operations being operations of a RISC-like instruction set converted from 
a set of CISC-like instructions, the RISC-like instruction set being as comprising: 

a plurality of mutually-uniform bit-length instruction codes, each code being divided 

into a plurality of defined-usage bit fields and the codes being glflagjf fod into a 
plurality of instruction classes, the code in an instruction class having a 
inutuaUy-consisteot definition of defined-usage bit-fields mrJi«««g a bit-field 
that is mapped using indirect specifiers so that a single mutually-unifonn bit- 
length instruction code maps into a plurality of instruction versions. 
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